Search Results: "tore"

1 May 2024

Antoine Beaupr : Tor migrates from Gitolite/GitWeb to GitLab

Note: I've been awfully silent here for the past ... (checks notes) oh dear, 3 months! But that's not because I've been idle, quite the contrary, I've been very busy but just didn't have time to write about anything. So I've taken it upon myself to write something about my work this week, and published this post on the Tor blog which I copy here for a broader audience. Let me know if you like this or not.
Tor has finally completed a long migration from legacy Git infrastructure (Gitolite and GitWeb) to our self-hosted GitLab server. Git repository addresses have therefore changed. Many of you probably have made the switch already, but if not, you will need to change:
https://git.torproject.org/
to:
https://gitlab.torproject.org/
In your Git configuration. The GitWeb front page is now an archived listing of all the repositories before the migration. Inactive git repositories were archived in GitLab legacy/gitolite namespace and the gitweb.torproject.org and git.torproject.org web sites now redirect to GitLab. Best effort was made to reproduce the original gitolite repositories faithfully and also avoid duplicating too much data in the migration. But it's possible that some data present in Gitolite has not migrated to GitLab. User repositories are particularly at risk, because they were massively migrated, and they were "re-forked" from their upstreams, to avoid wasting disk space. If a user had a project with a matching name it was assumed to have the right data, which might be inaccurate. The two virtual machines responsible for the legacy service (cupani for git-rw.torproject.org and vineale for git.torproject.org and gitweb.torproject.org) have been shutdown. Their disks will remain for 3 months (until the end of July 2024) and their backups for another year after that (until the end of July 2025), after which point all the data from those hosts will be destroyed, with only the GitLab archives remaining. The rest of this article expands on how this was done and what kind of problems we faced during the migration.

Where is the code? Normally, nothing should be lost. All repositories in gitolite have been either explicitly migrated by their owners, forcibly migrated by the sysadmin team (TPA), or explicitly destroyed at their owner's request. An exhaustive rewrite map translates gitolite projects to GitLab projects. Some of those projects actually redirect to their parent in cases of empty repositories that were obvious forks. Destroyed repositories redirect to the GitLab front page. Because the migration happened progressively, it's technically possible that commits pushed to gitolite were lost after the migration. We took great care to avoid that scenario. First, we adopted a proposal (TPA-RFC-36) in June 2023 to announce the transition. Then, in March 2024, we locked down all repositories from any further changes. Around that time, only a handful of repositories had changes made after the adoption date, and we examined each repository carefully to make sure nothing was lost. Still, we built a diff of all the changes in the git references that archivists can peruse to check for data loss. It's large (6MiB+) because a lot of repositories were migrated before the mass migration and then kept evolving in GitLab. Many other repositories were rebuilt in GitLab from parent to rebuild a fork relationship which added extra references to those clones. A note to amateur archivists out there, it's probably too late for one last crawl now. The Git repositories now all redirect to GitLab and are effectively unavailable in their original form. That said, the GitWeb site was crawled into the Internet Archive in February 2024, so at least some copy of it is available in the Wayback Machine. At that point, however, many developers had already migrated their projects to GitLab, so the copies there were already possibly out of date compared with the repositories in GitLab. Software Heritage also has a copy of all repositories hosted on Gitolite since June 2023 and have continuously kept mirroring the repositories, where they will be kept hopefully in eternity. There's an issue where the main website can't find the repositories when you search for gitweb.torproject.org, instead search for git.torproject.org. In any case, if you believe data is missing, please do let us know by opening an issue with TPA.

Why? This is an old project in the making. The first discussion about migrating from gitolite to GitLab started in 2020 (almost 4 years ago). But going further back, the first GitLab experiment was in 2016, almost a decade ago. The current GitLab server dates from 2019, replacing Trac for issue tracking in 2020. It was originally supposed to host only mirrors for merge requests and issue trackers but, naturally, one thing led to another and eventually, GitLab had grown a container registry, continuous integration (CI) runners, GitLab Pages, and, of course, hosted most Git repositories. There were hesitations at moving to GitLab for code hosting. We had discussions about the increased attack surface and ways to mitigate that, but, ultimately, it seems the issues were not that serious and the community embraced GitLab. TPA actually migrated its most critical repositories out of shared hosting entirely, into specific servers (e.g. the Puppet Git repository is just on the Puppet server now), leveraging Git's decentralized nature and removing an entire attack surface from our infrastructure. Some of those repositories are mirrored back into GitLab, but the authoritative copy is not on GitLab. In any case, the proposal to migrate from Gitolite to GitLab was effectively just formalizing a fait accompli.

How to migrate from Gitolite / cgit to GitLab The progressive migration was a challenge. If you intend to migrate between hosting platforms, we strongly recommend to make a "flag day" during which you migrate all repositories at once. This ensures a smoother transition and avoids elaborate rewrite rules. When Gitolite access was shutdown, we had repositories on both GitLab and Gitolite, without a clear relationship between the two. A priori, the plan then was to import all the remaining Gitolite repositories into the legacy/gitolite namespace, but that seemed wasteful, particularly for large repositories like Tor Browser which uses nearly a gigabyte of disk space. So we took special care to avoid duplicating repositories. When the mass migration started, only 71 of the 538 Gitolite repositories were Migrated to GitLab in the gitolite.conf file. So, given that we had hundreds of repositories to migrate:, we developed some automation to "save time". We already automate similar ad-hoc tasks with Fabric, so we used that framework here as well. (Our normal configuration management tool is Puppet, which is a poor fit here.) So a relatively large amount of Python code was produced to basically do the following:
  1. check if all on-disk repositories are listed in gitolite.conf (and vice versa) and either add missing repositories or delete them from disk if garbage
  2. for each repository in gitolite.conf, if its category is marked Migrated to GitLab, skip, otherwise;
  3. find a matching GitLab project by name, prompt the user for multiple matches
  4. if a match is found, redirect if the repository is non-empty
    • we have GitLab projects that look like the real thing, but are only present to host migrated Trac issues
    • in such cases we cloned the Gitolite project locally and pushed to the existing repository instead
  5. otherwise, a new repository is created in the legacy/gitolite namespace, using the "import" mechanism in GitLab to automatically import the repository from Gitolite, creating redirections and updating gitolite.conf to document the change
User repositories (those under the user/ directory in Gitolite) were handled specially. First, the existing redirection map was checked to see if a similarly named project was migrated (so that, e.g. user/dgoulet/tor is properly treated as a fork of tpo/core/tor). Then the parent project was forked in GitLab and the Gitolite project force-pushed to the fork. This allows us to show the fork relationship in GitLab and, more importantly, benefit from the "pool" feature in GitLab which deduplicates disk usage between forks. Sometimes, we found no such relationships. Then we simply imported multiple repositories with similar names in the legacy/gitolite namespace, sometimes creating forks between user repositories, on a first-come-first-served basis from the gitolite.conf order. The code used in this migration is now available publicly. We encourage other groups planning to migrate from Gitolite/GitWeb to GitLab to use (and contribute to) our fabric-tasks repository, even though it does have its fair share of hard-coded assertions. The main entry point is the gitolite.mass-repos-migration task. A typical migration job looked like:
anarcat@angela:fabric-tasks$ fab -H cupani.torproject.org gitolite.mass-repos-migration 
[...]
INFO: skipping project project/help/infra in category Migrated to GitLab
INFO: skipping project project/help/wiki in category Migrated to GitLab
INFO: skipping project project/jenkins/jobs in category Migrated to GitLab
INFO: skipping project project/jenkins/tools in category Migrated to GitLab
INFO: searching for projects matching fastlane
INFO: Successfully connected to https://gitlab.torproject.org
import gitolite project project/tor-browser/fastlane into gitlab legacy/gitolite/project/tor-browser/fastlane with desc 'Tor Browser app store and deployment configuration for Fastlane'? [Y/n] 
INFO: importing gitolite project project/tor-browser/fastlane into gitlab legacy/gitolite/project/tor-browser/fastlane with desc 'Tor Browser app store and deployment configuration for Fastlane'
INFO: building a new connect to cupani
INFO: defaulting name to fastlane
INFO: importing project into GitLab
INFO: Successfully connected to https://gitlab.torproject.org
INFO: loading group legacy/gitolite/project/tor-browser
INFO: archiving project
INFO: creating repository fastlane (fastlane) in namespace legacy/gitolite/project/tor-browser from https://git.torproject.org/project/tor-browser/fastlane into https://gitlab.torproject.org/legacy/gitolite/project/tor-browser/fastlane
INFO: migrating Gitolite repository project/tor-browser/fastlane to GitLab project legacy/gitolite/project/tor-browser/fastlane
INFO: uploading 399 bytes to /srv/git.torproject.org/repositories/project/tor-browser/fastlane.git/hooks/pre-receive
INFO: making /srv/git.torproject.org/repositories/project/tor-browser/fastlane.git/hooks/pre-receive executable
INFO: adding entry to rewrite_map /home/anarcat/src/tor/tor-puppet/modules/profile/files/git/gitolite2gitlab.txt
INFO: modifying gitolite.conf to add: "config gitweb.category = Migrated to GitLab"
INFO: rewriting gitolite config /home/anarcat/src/tor/gitolite-admin/conf/gitolite.conf to change project project/tor-browser/fastlane to category Migrated to GitLab
INFO: skipping project project/bridges/bridgedb-admin in category Migrated to GitLab
[...]
In the above, you can see migrated repositories skipped then the fastlane project being archived into GitLab. Another example with a later version of the script, processing only user repositories and showing the interactive prompt and a force-push into a fork:
$ fab -H cupani.torproject.org  gitolite.mass-repos-migration --include 'user/.*' --exclude '.*tor-?browser.*'
INFO: skipping project user/aagbsn/bridgedb in category Migrated to GitLab
[...]
INFO: skipping project user/phw/atlas in category Migrated to GitLab
INFO: processing project user/phw/obfsproxy (Philipp's obfsproxy repository) in category Users' development repositories (Attic)
INFO: Successfully connected to https://gitlab.torproject.org
INFO: user repository detected, trying to find fork phw/obfsproxy
WARNING: no existing fork found, entering user fork subroutine
INFO: found 6 GitLab projects matching 'obfsproxy' (https://gitweb.torproject.org/user/phw/obfsproxy.git)
0 legacy/gitolite/debian/obfsproxy
1 legacy/gitolite/debian/obfsproxy-legacy
2 legacy/gitolite/user/asn/obfsproxy
3 legacy/gitolite/user/ioerror/obfsproxy
4 tpo/anti-censorship/pluggable-transports/obfsproxy
5 tpo/anti-censorship/pluggable-transports/obfsproxy-legacy
select parent to fork from, or enter to abort: ^G4
INFO: repository is not empty: in-pack: 2104, packs: 1, size-pack: 414
fork project tpo/anti-censorship/pluggable-transports/obfsproxy into legacy/gitolite/user/phw/obfsproxy^G [Y/n] 
INFO: loading project tpo/anti-censorship/pluggable-transports/obfsproxy
INFO: forking project user/phw/obfsproxy into namespace legacy/gitolite/user/phw
INFO: waiting for fork to complete...
INFO: fork status: started, sleeping...
INFO: fork finished
INFO: cloning and force pushing from user/phw/obfsproxy to legacy/gitolite/user/phw/obfsproxy
INFO: deleting branch protection: <class 'gitlab.v4.objects.branches.ProjectProtectedBranch'> =>  'id': 2723, 'name': 'master', 'push_access_levels': [ 'id': 2864, 'access_level': 40, 'access_level_description': 'Maintainers', 'deploy_key_id': None ], 'merge_access_levels': [ 'id': 2753, 'access_level': 40, 'access_level_description': 'Maintainers' ], 'allow_force_push': False 
INFO: cloning repository git-rw.torproject.org:/srv/git.torproject.org/repositories/user/phw/obfsproxy.git in /tmp/tmp6orvjggy/user/phw/obfsproxy
Cloning into bare repository '/tmp/tmp6orvjggy/user/phw/obfsproxy'...
INFO: pushing to GitLab: https://gitlab.torproject.org/legacy/gitolite/user/phw/obfsproxy
remote: 
remote: To create a merge request for bug_10887, visit:        
remote:   https://gitlab.torproject.org/legacy/gitolite/user/phw/obfsproxy/-/merge_requests/new?merge_request%5Bsource_branch%5D=bug_10887        
remote: 
[...]
To ssh://gitlab.torproject.org/legacy/gitolite/user/phw/obfsproxy
 + 2bf9d09...a8e54d5 master -> master (forced update)
 * [new branch]      bug_10887 -> bug_10887
[...]
INFO: migrating repo
INFO: migrating Gitolite repository https://gitweb.torproject.org/user/phw/obfsproxy.git to GitLab project https://gitlab.torproject.org/legacy/gitolite/user/phw/obfsproxy
INFO: adding entry to rewrite_map /home/anarcat/src/tor/tor-puppet/modules/profile/files/git/gitolite2gitlab.txt
INFO: modifying gitolite.conf to add: "config gitweb.category = Migrated to GitLab"
INFO: rewriting gitolite config /home/anarcat/src/tor/gitolite-admin/conf/gitolite.conf to change project user/phw/obfsproxy to category Migrated to GitLab
INFO: processing project user/phw/scramblesuit (Philipp's ScrambleSuit repository) in category Users' development repositories (Attic)
INFO: user repository detected, trying to find fork phw/scramblesuit
WARNING: no existing fork found, entering user fork subroutine
WARNING: no matching gitlab project found for user/phw/scramblesuit
INFO: user fork subroutine failed, resuming normal procedure
INFO: searching for projects matching scramblesuit
import gitolite project user/phw/scramblesuit into gitlab legacy/gitolite/user/phw/scramblesuit with desc 'Philipp's ScrambleSuit repository'?^G [Y/n] 
INFO: checking if remote repo https://git.torproject.org/user/phw/scramblesuit exists
INFO: importing gitolite project user/phw/scramblesuit into gitlab legacy/gitolite/user/phw/scramblesuit with desc 'Philipp's ScrambleSuit repository'
INFO: importing project into GitLab
INFO: Successfully connected to https://gitlab.torproject.org
INFO: loading group legacy/gitolite/user/phw
INFO: creating repository scramblesuit (scramblesuit) in namespace legacy/gitolite/user/phw from https://git.torproject.org/user/phw/scramblesuit into https://gitlab.torproject.org/legacy/gitolite/user/phw/scramblesuit
INFO: archiving project
INFO: migrating Gitolite repository https://gitweb.torproject.org/user/phw/scramblesuit.git to GitLab project https://gitlab.torproject.org/legacy/gitolite/user/phw/scramblesuit
INFO: adding entry to rewrite_map /home/anarcat/src/tor/tor-puppet/modules/profile/files/git/gitolite2gitlab.txt
INFO: modifying gitolite.conf to add: "config gitweb.category = Migrated to GitLab"
INFO: rewriting gitolite config /home/anarcat/src/tor/gitolite-admin/conf/gitolite.conf to change project user/phw/scramblesuit to category Migrated to GitLab
[...]
Acute eyes will notice the bell used as a notification mechanism as well in this transcript. A lot of the code is now useless for us, but some, like "commit and push" or is-repo-empty live on in the git module and, of course, the gitlab module has grown some legs along the way. We've also found fun bugs, like a file descriptor exhaustion in bash, among other oddities. The retirement milestone and issue 41215 has a detailed log of the migration, for those curious. This was a challenging project, but it feels nice to have this behind us. This gets rid of 2 of the 4 remaining machines running Debian "old-old-stable", which moves a bit further ahead in our late bullseye upgrades milestone. Full transparency: we tested GPT-3.5, GPT-4, and other large language models to see if they could answer the question "write a set of rewrite rules to redirect GitWeb to GitLab". This has become a standard LLM test for your faithful writer to figure out how good a LLM is at technical responses. None of them gave an accurate, complete, and functional response, for the record. The actual rewrite rules as of this writing follow, for humans that actually like working answers provided by expert humans instead of artificial intelligence which currently seem to be, glorified, mansplaining interns.

git.torproject.org rewrite rules Those rules are relatively simple in that they rewrite a single URL to its equivalent GitLab counterpart in a 1:1 fashion. It relies on the rewrite map mentioned above, of course.
RewriteEngine on
# this RewriteMap connects the gitweb projects to their GitLab
# equivalent
RewriteMap gitolite2gitlab "txt:/etc/apache2/gitolite2gitlab.txt"
# if this becomes a performance bottleneck, convert to a DBM map with:
#
#  $ httxt2dbm -i mapfile.txt -o mapfile.map
#
# and:
#
# RewriteMap mapname "dbm:/etc/apache/mapfile.map"
#
# according to reports lavamind found online, we hit such a
# performance bottleneck only around millions of entries, which is not our case
# those two rules can go away once all the projects are
# migrated to GitLab
#
# this matches the request URI so we can check the RewriteMap
# for a match next
#
# WARNING: this won't match URLs without .git in them, which
# *do* work now. one possibility would be to match the request
# URI (without query string!) with:
#
# /git/(.*)(.git)?/(((branches hooks info objects/).*) git-.* upload-pack receive-pack HEAD config description)?.
#
# I haven't been able to figure out the actual structure of
# those URLs, so it's really hard to figure out the boundaries
# of the project name here. I stopped after pouring around the
# http-backend.c code in git
# itself. https://www.git-scm.com/docs/http-protocol is also
# kind of incomplete and unsatisfying.
RewriteCond % REQUEST_URI  ^/(git/)?(.*).git/.*$
# this makes the RewriteRule match only if there's a match in
# the rewrite map
RewriteCond $ gitolite2gitlab:%2 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(git/)?(.*).git/(.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$2 .git/$3 [R=302,L]
# Fallback everything else to GitLab
RewriteRule (.*) https://gitlab.torproject.org [R=302,L]

gitweb.torproject.org rewrite rules Those are the vastly more complicated GitWeb to GitLab rewrite rules. Note that we say "GitWeb" but we were actually not running GitWeb but cgit, as the former didn't actually scale for us.
RewriteEngine on
# this RewriteMap connects the gitweb projects to their GitLab
# equivalent
RewriteMap gitolite2gitlab "txt:/etc/apache2/gitolite2gitlab.txt"
# special rule to process targets of the old spec.tpo site and
# bring them to the right redirect on the new spec.tpo site. that should turn, for example:
#
# https://gitweb.torproject.org/torspec.git/tree/address-spec.txt
#
# into:
#
# https://spec.torproject.org/address-spec
RewriteRule ^/torspec.git/tree/(.*).txt$ https://spec.torproject.org/$1 [R=302]
# list of endpoints taken from cgit's cmd.c
# those two RewriteCond are necessary because we don't move
# all repositories at once. once the migration is completed,
# they can be removed.
#
# and yes, they are copied all over the place below
#
# create a match for the project name to check if the project
# has been moved to GitLab
RewriteCond % REQUEST_URI  ^/(.*).git(/.*)?$
# this makes the RewriteRule match only if there's a match in
# the rewrite map
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
# main project page, like summary below
RewriteRule ^/(.*).git/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 / [R=302,L]
# summary
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/summary/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 / [R=302,L]
# about
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/about/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 / [R=302,L]
# commit
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond "% QUERY_STRING " "(.*(?:^ &))id=([^&]*)(&.*)?$"
RewriteRule ^/(.*).git/commit/? https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commit/%2 [R=302,L,QSD]
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/commit/? https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commits/HEAD [R=302,L]
# diff, incomplete because can diff arbitrary refs and files in cgit but not in GitLab, hard to parse
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  id=([^&]*)
RewriteRule ^/(.*).git/diff/? https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commit/%1 [R=302,L,QSD]
# patch
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  id=([^&]*)
RewriteRule ^/(.*).git/patch/? https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commit/%1.patch [R=302,L,QSD]
# rawdiff, incomplete because can show only one file diff, which GitLab cannot
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  id=([^&]*)
RewriteRule ^/(.*).git/rawdiff/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commit/%1.diff [R=302,L,QSD]
# log
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  h=([^&]*)
RewriteRule ^/(.*).git/log/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commits/%1 [R=302,L,QSD]
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/log/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commits/HEAD [R=302,L]
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/log(/?.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commits/HEAD$2 [R=302,L]
# atom
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  h=([^&]*)
RewriteRule ^/(.*).git/atom/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commits/%1 [R=302,L,QSD]
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/atom/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/commits/HEAD [R=302,L,QSD]
# refs, incomplete because two pages in GitLab, defaulting to "tags"
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/refs/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/tags [R=302,L]
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  h=([^&]*)
RewriteRule ^/(.*).git/tag/? https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/tags/%1 [R=302,L,QSD]
# tree
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  id=([^&]*)
RewriteRule ^/(.*).git/tree(/?.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/tree/%1$2 [R=302,L,QSD]
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/tree(/?.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/tree/HEAD$2 [R=302,L]
# /-/tree has no good default in GitLab, revert to HEAD which is a good
# approximation (we can't assume "master" here anymore)
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/tree/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/tree/HEAD [R=302,L]
# plain
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteCond % QUERY_STRING  h=([^&]*)
RewriteRule ^/(.*).git/plain(/?.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/raw/%1$2 [R=302,L,QSD]
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/plain(/?.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/raw/HEAD$2 [R=302,L]
# blame: disabled
#RewriteCond % REQUEST_URI  ^/(.*).git/.*$
#RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
#RewriteCond % QUERY_STRING  h=([^&]*)
#RewriteRule ^/(.*).git/blame(/?.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/blame/%1$2 [R=302,L,QSD]
# same default as tree above
#RewriteCond % REQUEST_URI  ^/(.*).git/.*$
#RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
#RewriteRule ^/(.*).git/blame(/?.*)$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/blame/HEAD/$2 [R=302,L]
# stats
RewriteCond % REQUEST_URI  ^/(.*).git/.*$
RewriteCond $ gitolite2gitlab:%1 NOT_FOUND  !NOT_FOUND
RewriteRule ^/(.*).git/stats/?$ https://gitlab.torproject.org/$ gitolite2gitlab:$1 /-/graphs/HEAD [R=302,L]
# still TODO:
# repolist: once migration is complete
#
# cannot be done:
# atom: needs a feed token, user must be logged in
# blob: no direct equivalent
# info: not working on main cgit website?
# ls_cache: not working, irrelevant?
# objects: undocumented?
# snapshot: pattern too hard to match on cgit's side
# special case, we keep a copy of the main index on the archive
RewriteRule ^/?$ https://archive.torproject.org/websites/gitweb.torproject.org.html [R=302,L]
# Fallback: everything else to GitLab
RewriteRule .* https://gitlab.torproject.org [R=302,L]
The reference copy of those is available in our (currently private) Puppet git repository.

Matthew Palmer: The Mediocre Programmer's Guide to Rust

Me: Hi everyone, my name s Matt, and I m a mediocre programmer. Everyone: Hi, Matt. Facilitator: Are you an alcoholic, Matt? Me: No, not since I stopped reading Twitter. Facilitator: Then I think you re in the wrong room.
Yep, that s my little secret I m a mediocre programmer. The definition of the word hacker I most closely align with is someone who makes furniture with an axe . I write simple, straightforward code because trying to understand complexity makes my head hurt. Which is why I ve always avoided the more academic languages, like OCaml, Haskell, Clojure, and so on. I know they re good languages people far smarter than me are building amazing things with them but the time I hear the word endofunctor , I ve lost all focus (and most of my will to live). My preferred languages are the ones that come with less intellectual overhead, like C, PHP, Python, and Ruby. So it s interesting that I ve embraced Rust with significant vigour. It s by far the most complicated language that I feel at least vaguely comfortable with using in anger . Part of that is that I ve managed to assemble a set of principles that allow me to almost completely avoid arguing with Rust s dreaded borrow checker, lifetimes, and all the rest of the dark, scary corners of the language. It s also, I think, that Rust helps me to write better software, and I can feel it helping me (almost) all of the time. In the spirit of helping my fellow mediocre programmers to embrace Rust, I present the principles I ve assembled so far.

Neither a Borrower Nor a Lender Be If you know anything about Rust, you probably know about the dreaded borrow checker . It s the thing that makes sure you don t have two pieces of code trying to modify the same data at the same time, or using a value when it s no longer valid. While Rust s borrowing semantics allow excellent performance without compromising safety, for us mediocre programmers it gets very complicated, very quickly. So, the moment the compiler wants to start talking about explicit lifetimes , I shut it up by just using owned values instead. It s not that I never borrow anything; I have some situations that I know are borrow-safe for the mediocre programmer (I ll cover those later). But any time I m not sure how things will pan out, I ll go straight for an owned value. For example, if I need to store some text in a struct or enum, it s going straight into a String. I m not going to start thinking about lifetimes and &'a str; I ll leave that for smarter people. Similarly, if I need a list of things, it s a Vec<T> every time no &'b [T] in my structs, thank you very much.

Attack of the Clones Following on from the above, I ve come to not be afraid of .clone(). I scatter them around my code like seeds in a field. Life s too short to spend time trying to figure out who s borrowing what from whom, if I can just give everyone their own thing. There are warnings in the Rust book (and everywhere else) about how a clone can be expensive . While it s true that, yes, making clones of data structures consumes CPU cycles and memory, it very rarely matters. CPU cycles are (usually) plentiful and RAM (usually) relatively cheap. Mediocre programmer mental effort is expensive, and not to be spent on premature optimisation. Also, if you re coming from most any other modern language, Rust is already giving you so much more performance that you re probably ending up ahead of the game, even if you .clone() everything in sight. If, by some miracle, something I write gets so popular that the expense of all those spurious clones becomes a problem, it might make sense to pay someone much smarter than I to figure out how to make the program a zero-copy masterpiece of efficient code. Until then clone early and clone often, I say!

Derive Macros are Powerful Magicks If you start .clone()ing everywhere, pretty quickly you ll be hit with this error:

error[E0599]: no method named  clone  found for struct  Foo  in the current scope

This is because not everything can be cloned, and so if you want your thing to be cloned, you need to implement the method yourself. Well sort of. One of the things that I find absolutely outstanding about Rust is the derive macro . These allow you to put a little marker on a struct or enum, and the compiler will write a bunch of code for you! Clone is one of the available so-called derivable traits , so you add #[derive(Clone)] to your structs, and poof! you can .clone() to your heart s content. But there are other things that are commonly useful, and so I ve got a set of traits that basically all of my data structures derive:

#[derive(Clone, Debug, Default)]
struct Foo  
    // ...
 

Every time I write a struct or enum definition, that line #[derive(Clone, Debug, Default)] goes at the top. The Debug trait allows you to print a debug representation of the data structure, either with the dbg!() macro, or via the :? format in the format!() macro (and anywhere else that takes a format string). Being able to say what exactly is that? comes in handy so often, not having a Debug implementation is like programming with one arm tied behind your Aeron. Meanwhile, the Default trait lets you create an empty instance of your data structure, with all of the fields set to their own default values. This only works if all the fields themselves implement Default, but a lot of standard types do, so it s rare that you ll define a structure that can t have an auto-derived Default. Enums are easily handled too, you just mark one variant as the default:

#[derive(Clone, Debug, Default)]
enum Bar  
    Something(String),
    SomethingElse(i32),
    #[default]   // <== mischief managed
    Nothing,
 

Borrowing is OK, Sometimes While I previously said that I like and usually use owned values, there are a few situations where I know I can borrow without angering the borrow checker gods, and so I m comfortable doing it. The first is when I need to pass a value into a function that only needs to take a little look at the value to decide what to do. For example, if I want to know whether any values in a Vec<u32> are even, I could pass in a Vec, like this:

fn main()  
    let numbers = vec![0u32, 1, 2, 3, 4, 5];
    if has_evens(numbers)  
        println!("EVENS!");
     
 
fn has_evens(numbers: Vec<u32>) -> bool  
    numbers.iter().any( n  n % 2 == 0)
 

Howver, this gets ugly if I m going to use numbers later, like this:

fn main()  
    let numbers = vec![0u32, 1, 2, 3, 4, 5];
    if has_evens(numbers)  
        println!("EVENS!");
     
    // Compiler complains about "value borrowed here after move"
    println!("Sum:  ", numbers.iter().sum::<u32>());
 
fn has_evens(numbers: Vec<u32>) -> bool  
    numbers.iter().any( n  n % 2 == 0)
 

Helpfully, the compiler will suggest I use my old standby, .clone(), to fix this problem. But I know that the borrow checker won t have a problem with lending that Vec<u32> into has_evens() as a borrowed slice, &[u32], like this:

fn main()  
    let numbers = vec![0u32, 1, 2, 3, 4, 5];
    if has_evens(&numbers)  
        println!("EVENS!");
     
 
fn has_evens(numbers: &[u32]) -> bool  
    numbers.iter().any( n  n % 2 == 0)
 

The general rule I ve got is that if I can take advantage of lifetime elision (a fancy term meaning the compiler can figure it out ), I m probably OK. In less fancy terms, as long as the compiler doesn t tell me to put 'a anywhere, I m in the green. On the other hand, the moment the compiler starts using the words explicit lifetime , I nope the heck out of there and start cloning everything in sight. Another example of using lifetime elision is when I m returning the value of a field from a struct or enum. In that case, I can usually get away with returning a borrowed value, knowing that the caller will probably just be taking a peek at that value, and throwing it away before the struct itself goes out of scope. For example:

struct Foo  
    id: u32,
    desc: String,
 
impl Foo  
    fn description(&self) -> &str  
        &self.desc
     
 

Returning a reference from a function is practically always a mortal sin for mediocre programmers, but returning one from a struct method is often OK. In the rare case that the caller does want the reference I return to live for longer, they can always turn it into an owned value themselves, by calling .to_owned().

Avoid the String Tangle Rust has a couple of different types for representing strings String and &str being the ones you see most often. There are good reasons for this, however it complicates method signatures when you just want to take some sort of bunch of text , and don t care so much about the messy details. For example, let s say we have a function that wants to see if the length of the string is even. Using the logic that since we re just taking a peek at the value passed in, our function might take a string reference, &str, like this:

fn is_even_length(s: &str) -> bool  
    s.len() % 2 == 0
 

That seems to work fine, until someone wants to check a formatted string:

fn main()  
    // The compiler complains about "expected  &str , found  String "
    if is_even_length(format!("my string is  ", std::env::args().next().unwrap()))  
        println!("Even length string");
     
 

Since format! returns an owned string, String, rather than a string reference, &str, we ve got a problem. Of course, it s straightforward to turn the String from format!() into a &str (just prefix it with an &). But as mediocre programmers, we can t be expected to remember which sort of string all our functions take and add & wherever it s needed, and having to fix everything when the compiler complains is tedious. The converse can also happen: a method that wants an owned String, and we ve got a &str (say, because we re passing in a string literal, like "Hello, world!"). In this case, we need to use one of the plethora of available turn this into a String mechanisms (.to_string(), .to_owned(), String::from(), and probably a few others I ve forgotten), on the value before we pass it in, which gets ugly real fast. For these reasons, I never take a String or an &str as an argument. Instead, I use the Power of Traits to let callers pass in anything that is, or can be turned into, a string. Let us have some examples. First off, if I would normally use &str as the type, I instead use impl AsRef<str>:

fn is_even_length(s: impl AsRef<str>) -> bool  
    s.as_ref().len() % 2 == 0
 

Note that I had to throw in an extra as_ref() call in there, but now I can call this with either a String or a &str and get an answer. Now, if I want to be given a String (presumably because I plan on taking ownership of the value, say because I m creating a new instance of a struct with it), I use impl Into<String> as my type:

struct Foo  
    id: u32,
    desc: String,
 
impl Foo  
    fn new(id: u32, desc: impl Into<String>) -> Self  
        Self   id, desc: desc.into()  
     
 

We have to call .into() on our desc argument, which makes the struct building a bit uglier, but I d argue that s a small price to pay for being able to call both Foo::new(1, "this is a thing") and Foo::new(2, format!("This is a thing named name ")) without caring what sort of string is involved.

Always Have an Error Enum Rust s error handing mechanism (Results everywhere), along with the quality-of-life sugar surrounding it (like the short-circuit operator, ?), is a delightfully ergonomic approach to error handling. To make life easy for mediocre programmers, I recommend starting every project with an Error enum, that derives thiserror::Error, and using that in every method and function that returns a Result. How you structure your Error type from there is less cut-and-dried, but typically I ll create a separate enum variant for each type of error I want to have a different description. With thiserror, it s easy to then attach those descriptions:

#[derive(Clone, Debug, thiserror::Error)]
enum Error  
    #[error(" 0  caught fire")]
    Combustion(String),
    #[error(" 0  exploded")]
    Explosion(String),
 

I also implement functions to create each error variant, because that allows me to do the Into<String> trick, and can sometimes come in handy when creating errors from other places with .map_err() (more on that later). For example, the impl for the above Error would probably be:

impl Error  
    fn combustion(desc: impl Into<String>) -> Self  
        Self::Combustion(desc.into())
     
    fn explosion(desc: impl Into<String>) -> Self  
        Self::Explosion(desc.into())
     
 

It s a tedious bit of boilerplate, and you can use the thiserror-ext crate s thiserror_ext::Construct derive macro to do the hard work for you, if you like. It, too, knows all about the Into<String> trick.

Banish map_err (well, mostly) The newer mediocre programmer, who is just dipping their toe in the water of Rust, might write file handling code that looks like this:

fn read_u32_from_file(name: impl AsRef<str>) -> Result<u32, Error>  
    let mut f = File::open(name.as_ref())
        .map_err( e  Error::FileOpenError(name.as_ref().to_string(), e))?;
    let mut buf = vec![0u8; 30];
    f.read(&mut buf)
        .map_err( e  Error::ReadError(e))?;
    String::from_utf8(buf)
        .map_err( e  Error::EncodingError(e))?
        .parse::<u32>()
        .map_err( e  Error::ParseError(e))
 

This works great (or it probably does, I haven t actually tested it), but there are a lot of .map_err() calls in there. They take up over half the function, in fact. With the power of the From trait and the magic of the ? operator, we can make this a lot tidier. First off, assume we ve written boilerplate error creation functions (or used thiserror_ext::Construct to do it for us)). That allows us to simplify the file handling portion of the function a bit:

fn read_u32_from_file(name: impl AsRef<str>) -> Result<u32, Error>  
    let mut f = File::open(name.as_ref())
        // We've dropped the  .to_string()  out of here...
        .map_err( e  Error::file_open_error(name.as_ref(), e))?;
    let mut buf = vec![0u8; 30];
    f.read(&mut buf)
        // ... and the explicit parameter passing out of here
        .map_err(Error::read_error)?;
    // ...

If that latter .map_err() call looks weird, without the e and such, it s passing a function-as-closure, which just saves on a few characters typing. Just because we re mediocre, doesn t mean we re not also lazy. Next, if we implement the From trait for the other two errors, we can make the string-handling lines significantly cleaner. First, the trait impl:

impl From<std::string::FromUtf8Error> for Error  
    fn from(e: std::string::FromUtf8Error) -> Self  
        Self::EncodingError(e)
     
 
impl From<std::num::ParseIntError> for Error  
    fn from(e: std::num::ParseIntError) -> Self  
        Self::ParseError(e)
     
 

(Again, this is boilerplate that can be autogenerated, this time by adding a #[from] tag to the variants you want a From impl on, and thiserror will take care of it for you) In any event, no matter how you get the From impls, once you have them, the string-handling code becomes practically error-handling-free:

    Ok(
        String::from_utf8(buf)?
            .parse::<u32>()?
    )

The ? operator will automatically convert the error from the types returned from each method into the return error type, using From. The only tiny downside to this is that the ? at the end strips the Result, and so we ve got to wrap the returned value in Ok() to turn it back into a Result for returning. But I think that s a small price to pay for the removal of those .map_err() calls. In many cases, my coding process involves just putting a ? after every call that returns a Result, and adding a new Error variant whenever the compiler complains about not being able to convert some new error type. It s practically zero effort outstanding outcome for the mediocre programmer.

Just Because You re Mediocre, Doesn t Mean You Can t Get Better To finish off, I d like to point out that mediocrity doesn t imply shoddy work, nor does it mean that you shouldn t keep learning and improving your craft. One book that I ve recently found extremely helpful is Effective Rust, by David Drysdale. The author has very kindly put it up to read online, but buying a (paper or ebook) copy would no doubt be appreciated. The thing about this book, for me, is that it is very readable, even by us mediocre programmers. The sections are written in a way that really clicked with me. Some aspects of Rust that I d had trouble understanding for a long time such as lifetimes and the borrow checker, and particularly lifetime elision actually made sense after I d read the appropriate sections.

Finally, a Quick Beg I m currently subsisting on the kindness of strangers, so if you found something useful (or entertaining) in this post, why not buy me a refreshing beverage? It helps to know that people like what I m doing, and helps keep me from having to sell my soul to a private equity firm.

26 April 2024

Russell Coker: Humane AI Pin

I wrote a blog post The Shape of Computers [1] exploring ideas of how computers might evolve and how we can use them. One of the devices I mentioned was the Humane AI Pin, which has just been the recipient of one of the biggest roast reviews I ve ever seen [2], good work Marques Brownlee! As an aside I was once given a product to review which didn t work nearly as well as I think it should have worked so I sent an email to the developers saying sorry this product failed to work well so I can t say anything good about it and didn t publish a review. One of the first things that caught my attention in the review is the note that the AI Pin doesn t connect to your phone. I think that everything should connect to everything else as a usability feature. For security we don t want so much connecting and it s quite reasonable to turn off various connections at appropriate times for security, the Librem5 is an example of how this can be done with hardware switches to disable Wifi etc. But to just not have connectivity is bad. The next noteworthy thing is the external battery which also acts as a magnetic attachment from inside your shirt. So I guess it s using wireless charging through your shirt. A magnetically attached external battery would be a great feature for a phone, you could quickly swap a discharged battery for a fresh one and keep using it. When I tried to make the PinePhonePro my daily driver [3] I gave up and charging was one of the main reasons. One thing I learned from my experiment with the PinePhonePro is that the ratio of charge time to discharge time is sometimes more important than battery life and being able to quickly swap batteries without rebooting is a way of solving that. The reviewer of the AI Pin complains later in the video about battery life which seems to be partly due to wireless charging from the detachable battery and partly due to being physically small. It seems the phablet form factor is the smallest viable personal computer at this time. The review glosses over what could be the regarded as the 2 worst issues of the device. It does everything via the cloud (where the cloud means a computer owned by someone I probably shouldn t trust ) and it records everything. Strange that it s not getting the hate the Google Glass got. The user interface based on laser projection of menus on the palm of your hand is an interesting concept. I d rather have a Bluetooth attached tablet or something for operations that can t be conveniently done with voice. The reviewer harshly criticises the laser projection interface later in the video, maybe technology isn t yet adequate to implement this properly. The first criticism of the device in the review part of the video is of the time taken to answer questions, especially when Internet connectivity is poor. His question who designed the Washington Monument took 8 seconds to start answering it in his demonstration. I asked the Alpaca LLM the same question running on 4 cores of a E5-2696 and it took 10 seconds to start answering and then printed the words at about speaking speed. So if we had a free software based AI device for this purpose it shouldn t be difficult to get local LLM computation with less delay than the Humane device by simply providing more compute power than 4 cores of a E5-2696v3. How does a 32 core 1.05GHz Mali G72 from 2017 (as used in the Galaxy Note 9) compare to 4 cores of a 2.3GHz Intel CPU from 2015? Passmark says that Intel CPU can do 48GFlop with all 18 cores so 4 cores can presumably do about 10GFlop which seems less than the claimed 20-32GFlop of the Mali G72. It seems that with the right software even older Android phones could give adequate performance for a local LLM. The Alpaca model I m testing with takes 4.2G of RAM to run which is usable in a Note 9 with 8G of RAM or a Pixel 8 Pro with 12G. A Pixel 8 Pro could have 4.2G of RAM reserved for a LLM and still have as much RAM for other purposes as my main laptop as of a few months ago. I consider the speed of Alpaca on my workstation to be acceptable but not great. If we can get FOSS phones running a LLM at that speed then I think it would be great for a first version we can always rely on newer and faster hardware becoming available. Marques notes that the cause of some of the problems is likely due to a desire to make it a separate powerful product in the future and that if they gave it phone connectivity in the start they would have to remove that later on. I think that the real problem is that the profit motive is incompatible with good design. They want to have a product that s stand-alone and justifies the purchase price plus subscription and that means not making it a phone accessory . While I think that the best thing for the user is to allow it to talk to a phone, a PC, a car, and anything else the user wants. He compares it to the Apple Vision Pro which has the same issue of trying to be a stand-alone computer but not being properly capable of it. One of the benefits that Marques cites for the AI Pin is the ability to capture voice notes. Dictaphones have been around for over 100 years and very few people have bought them, not even in the 80s when they became cheap. While almost everyone can occasionally benefit from being able to make a note of an idea when it s not convenient to write it down there are few people who need it enough to carry a separate device, not even if that device is tiny. But a phone as a general purpose computing device with microphone can easily be adapted to such things. One possibility would be to program a phone to start a voice note when the volume up and down buttons are pressed at the same time or when some other condition is met. Another possibility is to have a phone have a hotkey function that varies by what you are doing, EG if bushwalking have the hotkey be to take a photo or if on a flight have it be taking a voice note. On the Mobile Apps page on the Debian wiki I created a section for categories of apps that I think we need [4]. In that section I added the following list:
  1. Voice input for dictation
  2. Voice assistant like Google/Apple
  3. Voice output
  4. Full operation for visually impaired people
One thing I really like about the AI Pin is that it has the potential to become a really good computing and personal assistant device for visually impaired people funded by people with full vision who want to legally control a computer while driving etc. I have some concerns about the potential uses of the AI Pin while driving (as Marques stated an aim to do), but if it replaces the use of regular phones while driving it will make things less bad. Marques concludes his video by warning against buying a product based on the promise of what it can be in future. I bought the Librem5 on exactly that promise, the difference is that I have the source and the ability to help make the promise come true. My aim is to spend thousands of dollars on test hardware and thousands of hours of development time to help make FOSS phones a product that most people can use at low price with little effort. Another interesting review of the pin is by Mrwhostheboss [5], one of his examples is of asking the pin for advice about a chair but without him knowing the pin selected a different chair in the room. He compares this to using Google s apps on a phone and seeing which item the app has selected. He also said that he doesn t want to make an order based on speech he wants to review a page of information about it. I suspect that the design of the pin had too much input from people accustomed to asking a corporate travel office to find them a flight and not enough from people who look through the details of the results of flight booking services trying to save an extra $20. Some people might say if you need to save $20 on a flight then a $24/month subscription computing service isn t for you , I reject that argument. I can afford lots of computing services because I try to get the best deal on every moderately expensive thing I pay for. Another point that Mrwhostheboss makes is regarding secret SMS, you probably wouldn t want to speak a SMS you are sending to your SO while waiting for a train. He makes it clear that changing between phone and pin while sharing resources (IE not having a separate phone number and separate data store) is a desired feature. The most insightful point Mrwhostheboss made was when he suggested that if the pin had come out before the smartphone then things might have all gone differently, but now anything that s developed has to be based around the expectations of phone use. This is something we need to keep in mind when developing FOSS software, there s lots of different ways that things could be done but we need to meet the expectations of users if we want our software to be used by many people. I previously wrote a blog post titled Considering Convergence [6] about the possible ways of using a phone as a laptop. While I still believe what I wrote there I m now considering the possibility of ease of movement of work in progress as a way of addressing some of the same issues. I ve written a blog post about Convergence vs Transferrence [7].

Russell Coker: Convergence vs Transference

I previously wrote a blog post titled Considering Convergence [1] about the possible ways of using a phone as a laptop. While I still believe what I wrote there I m now considering the possibility of ease of movement of work in progress as a way of addressing some of the same issues. Currently the expected use is that if you have web pages open on Chrome on Android it s possible to instruct Chrome on the desktop to open the same page if both instances of Chrome are signed in to the same GMail account. It s also possible to view the Chrome history with CTRL-H, select tabs from other devices and load things that were loaded on other devices some time ago. This is very minimal support for moving work between devices and I think we can do better. Firstly for web browsing the Chrome functionality is barely adequate. It requires having a heavyweight login process on all browsers that includes sharing stored passwords etc which isn t desirable. There are many cases where moving work is desired without sharing such things, one example is using a personal device to research something for work. Also the Chrome method of sending web pages is slow and unreliable and the viewing history method gets all closed tabs when the common case is get the currently open tabs from one browser window without wanting the dozens of web pages that turned out not to be interesting and were closed. This could be done with browser plugins to allow functionality similar to KDE Connect for sending tabs and also the option of emailing a list of URLs or a JSON file that could be processed by a browser plugin on the receiving end. I can send email between my home and work addresses faster than the Chrome share to another device function can send a URL. For documents we need a way of transferring files. One possibility is to go the Chromebook route and have it all stored on the web. This means that you rely on a web based document editing system and the FOSS versions are difficult to manage. Using Google Docs or Sharepoint for everything is not something I consider an acceptable option. Also for laptop use being able to run without Internet access is a good thing. There are a range of distributed filesystems that have been used for various purposes. I don t think any of them cater to the use case of having a phone/laptop and a desktop PC (or maybe multiple PCs) using the same files. For a technical user it would be an option to have a script that connects to a peer system (IE another computer with the same accounts and access control decisions) and rsync a directory of working files and the shell history, and then opens a shell with the HISTFILE variable, current directory, and optionally some user environment variables set to match. But this wouldn t be the most convenient thing even for technical users. For programs that are integrated into the desktop environment it s possible for them to be restarted on login if they were active when the user logged out. The session tracking for that has about 1/4 the functionality needed for requesting a list of open files from the application, closing the application, transferring the files, and opening it somewhere else. I think that this would be a good feature to add to the XDG setup. The model of having programs and data attached to one computer or one network server that terminals of some sort connect to worked well when computers were big and expensive. But computers continue to get smaller and cheaper so we need to think of a document based use of computers to allow things to be easily transferred as convenient. With convenience being important so the hacks of rsync scripts that can work for technical users won t work for most people.

24 April 2024

Russell Coker: Source Code With Emoji

The XKCD comic Code Quality [1] inspired me to test out emoji in source. I really should have done this years ago when that XKCD was first published. The following code compiles in gcc and runs in the way that anyone who wants to write such code would want it to run. The hover text in the XKCD comic is correct. You could have a style guide for such programming, store error messages in the doctor and nurse emoji for example.
#include <stdio.h>
int main()
 
  int   = 1,   = 2;
  printf(" =%d,  =%d\n",  ,  );
  return 0;
 
To get this to display correctly in Debian you need to install the fonts-noto-color-emoji package (used by the KDE emoji picker that runs when you press Windows-. among other things) and restart programs that use emoji. The Konsole terminal emulator will probably need it s profile settings changed to work with this if you ran Konsole before installing fonts-noto-color-emoji. The Kitty terminal emulator works if you restart it after installing fonts-noto-color-emoji. This web page gives a list of HTML codes for emoji [2]. If I start writing real code with emoji variable names then I ll have to update my source to HTML conversion script (which handles <>" and repeated spaces) to convert emoji. I spent a couple of hours on this and I think it s worth it. I have filed several Debian bug reports about improvements needed to issues related to emoji.

22 April 2024

Russ Allbery: Review: The Stars, Like Dust

Review: The Stars, Like Dust, by Isaac Asimov
Series: Galactic Empire #2
Publisher: Fawcett Crest
Copyright: 1950, 1951
Printing: June 1972
Format: Mass market
Pages: 192
The Stars, Like Dust is usually listed as the first book in Asimov's lesser-known Galactic Empire Trilogy since it takes place before Pebble in the Sky. Pebble in the Sky was published first, though, so I count it as the second book. It is very early science fiction with a few mystery overtones. Buying books produces about 5% of the pleasure of reading them while taking much less than 5% of the time. There was a time in my life when I thoroughly enjoyed methodically working through a used book store, list in hand, tracking down cheap copies to fill in holes in series. This means that I own a lot of books that I thought at some point that I would want to read but never got around to, often because, at the time, I was feeling completionist about some series or piece of world-building. From time to time, I get the urge to try to read some of them. Sometimes this is a poor use of my time. The Galactic Empire series is from Asimov's first science fiction period, after the Foundation series but contemporaneous with their collection into novels. They're set long, long before Foundation, but after humans have inhabited numerous star systems and Earth has become something of a backwater. That process is just starting in The Stars, Like Dust: Earth is still somewhere where an upper-class son might be sent for an education, but it has been devastated by nuclear wars and is well on its way to becoming an inward-looking relic on the edge of galactic society. Biron Farrill is the son of the Lord Rancher of Widemos, a wealthy noble whose world is one of those conquered by the Tyranni. In many other SF novels, the Tyranni would be an alien race; here, it's a hierarchical and authoritarian human civilization. The book opens with Biron discovering a radiation bomb planted in his dorm room. Shortly after, he learns that his father had been arrested. One of his fellow students claims to be on Biron's side against the Tyranni and gives him false papers to travel to Rhodia, a wealthy world run by a Tyranni sycophant. Like most books of this era, The Stars, Like Dust is a short novel full of plot twists. Unlike some of its contemporaries, it's not devoid of characterization, but I might have liked it better if it were. Biron behaves like an obnoxious teenager when he's not being an arrogant ass. There is a female character who does a few plot-relevant things and at no point is sexually assaulted, so I'll give Asimov that much, but the gender stereotypes are ironclad and there is an entire subplot focused on what I can only describe as seduction via petty jealousy. The writing... well, let me quote a typical passage:
There was no way of telling when the threshold would be reached. Perhaps not for hours, and perhaps the next moment. Biron remained standing helplessly, flashlight held loosely in his damp hands. Half an hour before, the visiphone had awakened him, and he had been at peace then. Now he knew he was going to die. Biron didn't want to die, but he was penned in hopelessly, and there was no place to hide.
Needless to say, Biron doesn't die. Even if your tolerance for pulp melodrama is high, 192 small-print pages of this sort of thing is wearying. Like a lot of Asimov plots, The Stars, Like Dust has some of the shape of a mystery novel. Biron, with the aid of some newfound companions on Rhodia, learns of a secret rebellion against the Tyranni and attempts to track down its base to join them. There are false leads, disguised identities, clues that are difficult to interpret, and similar classic mystery trappings, all covered with a patina of early 1950s imaginary science. To me, it felt constructed and artificial in ways that made the strings Asimov was pulling obvious. I don't know if someone who likes mystery construction would feel differently about it. The worst part of the plot thankfully doesn't come up much. We learn early in the story that Biron was on Earth to search for a long-lost document believed to be vital to defeating the Tyranni. The nature of that document is revealed on the final page, so I won't spoil it, but if you try to think of the stupidest possible document someone could have built this plot around, I suspect you will only need one guess. (In Asimov's defense, he blamed Galaxy editor H.L. Gold for persuading him to include this plot, and disavowed it a few years later.) The Stars, Like Dust is one of the worst books I have ever read. The characters are overwrought, the politics are slapdash and build on broad stereotypes, the romantic subplot is dire and plays out mainly via Biron egregiously manipulating his petulant love interest, and the writing is annoying. Sometimes pulp fiction makes up for those common flaws through larger-than-life feats of daring, sweeping visions of future societies, and ever-escalating stakes. There is little to none of that here. Asimov instead provides tedious political maneuvering among a class of elitist bankers and land owners who consider themselves natural leaders. The only places where the power structures of this future government make sense are where Asimov blatantly steals them from either the Roman Empire or the Doge of Venice. The one thing this book has going for it the thing, apart from bloody-minded completionism, that kept me reading is that the technology is hilariously weird in that way that only 1940s and 1950s science fiction can be. The characters have access to communication via some sort of interstellar telepathy (messages coded to a specific person's "brain waves") and can travel between stars through hyperspace jumps, but each jump is manually calculated by referring to the pilot's (paper!) volumes of the Standard Galactic Ephemeris. Communication between ships (via "etheric radio") requires manually aiming a radio beam at the area in space where one thinks the other ship is. It's an unintentionally entertaining combination of technology that now looks absurdly primitive and science that is so advanced and hand-waved that it's obviously made up. I also have to give Asimov some points for using spherical coordinates. It's a small thing, but the coordinate systems in most SF novels and TV shows are obviously not fit for purpose. I spent about a month and a half of this year barely reading, and while some of that is because I finally tackled a few projects I'd been putting off for years, a lot of it was because of this book. It was only 192 pages, and I'm still curious about the glue between Asimov's Foundation and Robot series, both of which I devoured as a teenager. But every time I picked it up to finally finish it and start another book, I made it about ten pages and then couldn't take any more. Learn from my error: don't try this at home, or at least give up if the same thing starts happening to you. Followed by The Currents of Space. Rating: 2 out of 10

19 April 2024

Louis-Philippe V ronneau: Montreal's Debian & Stuff - March 2024

Time really flies when you are really busy you have fun! Our Montr al Debian User Group met on Sunday March 31st and I only just found the time to write our report :) This time around, 9 of us we met at EfficiOS's offices1 to chat, hang out and work on Debian and other stuff! Here is what we did: pollo: tvaz: tassia: viashimo: lavamind: justin: Pictures Here are pictures of the event. Well, one picture (thanks Tassia!) of the event itself and another one of the crisp Italian lager I drank at the bar after the event :) People at the event working around a long table A glass of beer illuminated by sunlight

  1. Maintainers, amongst other things, of the great LTTng.

18 April 2024

Thomas Koch: Minimal overhead VMs with Nix and MicroVM

Posted on March 17, 2024
Joachim Breitner wrote about a Convenient sandboxed development environment and thus reminded me to blog about MicroVM. I ve toyed around with it a little but not yet seriously used it as I m currently not coding. MicroVM is a nix based project to configure and run minimal VMs. It can mount and thus reuse the hosts nix store inside the VM and thus has a very small disk footprint. I use MicroVM on a debian system using the nix package manager. The MicroVM author uses the project to host production services. Otherwise I consider it also a nice way to learn about NixOS after having started with the nix package manager and before making the big step to NixOS as my main system. The guests root filesystem is a tmpdir, so one must explicitly define folders that should be mounted from the host and thus be persistent across VM reboots. I defined the VM as a nix flake since this is how I started from the MicroVM projects example:
 
  description = "Haskell dev MicroVM";
  inputs.impermanence.url = "github:nix-community/impermanence";
  inputs.microvm.url = "github:astro/microvm.nix";
  inputs.microvm.inputs.nixpkgs.follows = "nixpkgs";
  outputs =   self, impermanence, microvm, nixpkgs  :
    let
      persistencePath = "/persistent";
      system = "x86_64-linux";
      user = "thk";
      vmname = "haskell";
      nixosConfiguration = nixpkgs.lib.nixosSystem  
          inherit system;
          modules = [
            microvm.nixosModules.microvm
            impermanence.nixosModules.impermanence
            ( pkgs, ...  :  
            environment.persistence.$ persistencePath  =  
                hideMounts = true;
                users.$ user  =  
                  directories = [
                    "git" ".stack"
                  ];
                 ;
               ;
              environment.sessionVariables =  
                TERM = "screen-256color";
               ;
              environment.systemPackages = with pkgs; [
                ghc
                git
                (haskell-language-server.override   supportedGhcVersions = [ "94" ];  )
                htop
                stack
                tmux
                tree
                vcsh
                zsh
              ];
              fileSystems.$ persistencePath .neededForBoot = nixpkgs.lib.mkForce true;
              microvm =  
                forwardPorts = [
                    from = "host"; host.port = 2222; guest.port = 22;  
                    from = "guest"; host.port = 5432; guest.port = 5432;   # postgresql
                ];
                hypervisor = "qemu";
                interfaces = [
                    type = "user"; id = "usernet"; mac = "00:00:00:00:00:02";  
                ];
                mem = 4096;
                shares = [  
                  # use "virtiofs" for MicroVMs that are started by systemd
                  proto = "9p";
                  tag = "ro-store";
                  # a host's /nix/store will be picked up so that no
                  # squashfs/erofs will be built for it.
                  source = "/nix/store";
                  mountPoint = "/nix/.ro-store";
                   
                  proto = "virtiofs";
                  tag = "persistent";
                  source = "~/.local/share/microvm/vms/$ vmname /persistent";
                  mountPoint = persistencePath;
                  socket = "/run/user/1000/microvm-$ vmname -persistent";
                 
                ];
                socket = "/run/user/1000/microvm-control.socket";
                vcpu = 3;
                volumes = [];
                writableStoreOverlay = "/nix/.rwstore";
               ;
              networking.hostName = vmname;
              nix.enable = true;
              nix.nixPath = ["nixpkgs=$ builtins.storePath <nixpkgs> "];
              nix.settings =  
                extra-experimental-features = ["nix-command" "flakes"];
                trusted-users = [user];
               ;
              security.sudo =  
                enable = true;
                wheelNeedsPassword = false;
               ;
              services.getty.autologinUser = user;
              services.openssh =  
                enable = true;
               ;
              system.stateVersion = "24.11";
              systemd.services.loadnixdb =  
                description = "import hosts nix database";
                path = [pkgs.nix];
                wantedBy = ["multi-user.target"];
                requires = ["nix-daemon.service"];
                script = "cat $ persistencePath /nix-store-db-dump nix-store --load-db";
               ;
              time.timeZone = nixpkgs.lib.mkDefault "Europe/Berlin";
              users.users.$ user  =  
                extraGroups = [ "wheel" "video" ];
                group = "user";
                isNormalUser = true;
                openssh.authorizedKeys.keys = [
                  "ssh-rsa REDACTED"
                ];
                password = "";
               ;
              users.users.root.password = "";
              users.groups.user =  ;
             )
          ];
         ;
    in  
      packages.$ system .default = nixosConfiguration.config.microvm.declaredRunner;
     ;
 
I start the microVM with a templated systemd user service:
[Unit]
Description=MicroVM for Haskell development
Requires=microvm-virtiofsd-persistent@.service
After=microvm-virtiofsd-persistent@.service
AssertFileNotEmpty=%h/.local/share/microvm/vms/%i/flake/flake.nix
[Service]
Type=forking
ExecStartPre=/usr/bin/sh -c "[ /nix/var/nix/db/db.sqlite -ot %h/.local/share/microvm/nix-store-db-dump ]   nix-store --dump-db >%h/.local/share/microvm/nix-store-db-dump"
ExecStartPre=ln -f -t %h/.local/share/microvm/vms/%i/persistent/ %h/.local/share/microvm/nix-store-db-dump
ExecStartPre=-%h/.local/state/nix/profile/bin/tmux new -s microvm -d
ExecStart=%h/.local/state/nix/profile/bin/tmux new-window -t microvm: -n "%i" "exec %h/.local/state/nix/profile/bin/nix run --impure %h/.local/share/microvm/vms/%i/flake"
The above service definition creates a dump of the hosts nix store db so that it can be imported in the guest. This is necessary so that the guest can actually use what is available in /nix/store. There is an effort for an overlayed nix store that would be preferable to this hack. Finally the microvm is started inside a tmux session named microvm . This way I can use the VM with SSH or through the console and also access the qemu console. And for completeness the virtiofsd service:
[Unit]
Description=serve host persistent folder for dev VM
AssertPathIsDirectory=%h/.local/share/microvm/vms/%i/persistent
[Service]
ExecStart=%h/.local/state/nix/profile/bin/virtiofsd \
 --socket-path=$ XDG_RUNTIME_DIR /microvm-%i-persistent \
 --shared-dir=%h/.local/share/microvm/vms/%i/persistent \
 --gid-map :995:%G:1: \
 --uid-map :1000:%U:1:

13 April 2024

Paul Tagliamonte: Domo Arigato, Mr. debugfs

Years ago, at what I think I remember was DebConf 15, I hacked for a while on debhelper to write build-ids to debian binary control files, so that the build-id (more specifically, the ELF note .note.gnu.build-id) wound up in the Debian apt archive metadata. I ve always thought this was super cool, and seeing as how Michael Stapelberg blogged some great pointers around the ecosystem, including the fancy new debuginfod service, and the find-dbgsym-packages helper, which uses these same headers, I don t think I m the only one. At work I ve been using a lot of rust, specifically, async rust using tokio. To try and work on my style, and to dig deeper into the how and why of the decisions made in these frameworks, I ve decided to hack up a project that I ve wanted to do ever since 2015 write a debug filesystem. Let s get to it.

Back to the Future Time to admit something. I really love Plan 9. It s just so good. So many ideas from Plan 9 are just so prescient, and everything just feels right. Not just right like, feels good like, correct. The bit that I ve always liked the most is 9p, the network protocol for serving a filesystem over a network. This leads to all sorts of fun programs, like the Plan 9 ftp client being a 9p server you mount the ftp server and access files like any other files. It s kinda like if fuse were more fully a part of how the operating system worked, but fuse is all running client-side. With 9p there s a single client, and different servers that you can connect to, which may be backed by a hard drive, remote resources over something like SFTP, FTP, HTTP or even purely synthetic. The interesting (maybe sad?) part here is that 9p wound up outliving Plan 9 in terms of adoption 9p is in all sorts of places folks don t usually expect. For instance, the Windows Subsystem for Linux uses the 9p protocol to share files between Windows and Linux. ChromeOS uses it to share files with Crostini, and qemu uses 9p (virtio-p9) to share files between guest and host. If you re noticing a pattern here, you d be right; for some reason 9p is the go-to protocol to exchange files between hypervisor and guest. Why? I have no idea, except maybe due to being designed well, simple to implement, and it s a lot easier to validate the data being shared and validate security boundaries. Simplicity has its value. As a result, there s a lot of lingering 9p support kicking around. Turns out Linux can even handle mounting 9p filesystems out of the box. This means that I can deploy a filesystem to my LAN or my localhost by running a process on top of a computer that needs nothing special, and mount it over the network on an unmodified machine unlike fuse, where you d need client-specific software to run in order to mount the directory. For instance, let s mount a 9p filesystem running on my localhost machine, serving requests on 127.0.0.1:564 (tcp) that goes by the name mountpointname to /mnt.
$ mount -t 9p \
-o trans=tcp,port=564,version=9p2000.u,aname=mountpointname \
127.0.0.1 \
/mnt
Linux will mount away, and attach to the filesystem as the root user, and by default, attach to that mountpoint again for each local user that attempts to use it. Nifty, right? I think so. The server is able to keep track of per-user access and authorization along with the host OS.

WHEREIN I STYX WITH IT Since I wanted to push myself a bit more with rust and tokio specifically, I opted to implement the whole stack myself, without third party libraries on the critical path where I could avoid it. The 9p protocol (sometimes called Styx, the original name for it) is incredibly simple. It s a series of client to server requests, which receive a server to client response. These are, respectively, T messages, which transmit a request to the server, which trigger an R message in response (Reply messages). These messages are TLV payload with a very straight forward structure so straight forward, in fact, that I was able to implement a working server off nothing more than a handful of man pages. Later on after the basics worked, I found a more complete spec page that contains more information about the unix specific variant that I opted to use (9P2000.u rather than 9P2000) due to the level of Linux specific support for the 9P2000.u variant over the 9P2000 protocol.

MR ROBOTO The backend stack over at zoo is rust and tokio running i/o for an HTTP and WebRTC server. I figured I d pick something fairly similar to write my filesystem with, since 9P can be implemented on basically anything with I/O. That means tokio tcp server bits, which construct and use a 9p server, which has an idiomatic Rusty API that partially abstracts the raw R and T messages, but not so much as to cause issues with hiding implementation possibilities. At each abstraction level, there s an escape hatch allowing someone to implement any of the layers if required. I called this framework arigato which can be found over on docs.rs and crates.io.
/// Simplified version of the arigato File trait; this isn't actually
/// the same trait; there's some small cosmetic differences. The
/// actual trait can be found at:
///
/// https://docs.rs/arigato/latest/arigato/server/trait.File.html
trait File  
/// OpenFile is the type returned by this File via an Open call.
 type OpenFile: OpenFile;
/// Return the 9p Qid for this file. A file is the same if the Qid is
 /// the same. A Qid contains information about the mode of the file,
 /// version of the file, and a unique 64 bit identifier.
 fn qid(&self) -> Qid;
/// Construct the 9p Stat struct with metadata about a file.
 async fn stat(&self) -> FileResult<Stat>;
/// Attempt to update the file metadata.
 async fn wstat(&mut self, s: &Stat) -> FileResult<()>;
/// Traverse the filesystem tree.
 async fn walk(&self, path: &[&str]) -> FileResult<(Option<Self>, Vec<Self>)>;
/// Request that a file's reference be removed from the file tree.
 async fn unlink(&mut self) -> FileResult<()>;
/// Create a file at a specific location in the file tree.
 async fn create(
&mut self,
name: &str,
perm: u16,
ty: FileType,
mode: OpenMode,
extension: &str,
) -> FileResult<Self>;
/// Open the File, returning a handle to the open file, which handles
 /// file i/o. This is split into a second type since it is genuinely
 /// unrelated -- and the fact that a file is Open or Closed can be
 /// handled by the  arigato  server for us.
 async fn open(&mut self, mode: OpenMode) -> FileResult<Self::OpenFile>;
 
/// Simplified version of the arigato OpenFile trait; this isn't actually
/// the same trait; there's some small cosmetic differences. The
/// actual trait can be found at:
///
/// https://docs.rs/arigato/latest/arigato/server/trait.OpenFile.html
trait OpenFile  
/// iounit to report for this file. The iounit reported is used for Read
 /// or Write operations to signal, if non-zero, the maximum size that is
 /// guaranteed to be transferred atomically.
 fn iounit(&self) -> u32;
/// Read some number of bytes up to  buf.len()  from the provided
 ///  offset  of the underlying file. The number of bytes read is
 /// returned.
 async fn read_at(
&mut self,
buf: &mut [u8],
offset: u64,
) -> FileResult<u32>;
/// Write some number of bytes up to  buf.len()  from the provided
 ///  offset  of the underlying file. The number of bytes written
 /// is returned.
 fn write_at(
&mut self,
buf: &mut [u8],
offset: u64,
) -> FileResult<u32>;
 

Thanks, decade ago paultag! Let s do it! Let s use arigato to implement a 9p filesystem we ll call debugfs that will serve all the debug files shipped according to the Packages metadata from the apt archive. We ll fetch the Packages file and construct a filesystem based on the reported Build-Id entries. For those who don t know much about how an apt repo works, here s the 2-second crash course on what we re doing. The first is to fetch the Packages file, which is specific to a binary architecture (such as amd64, arm64 or riscv64). That architecture is specific to a component (such as main, contrib or non-free). That component is specific to a suite, such as stable, unstable or any of its aliases (bullseye, bookworm, etc). Let s take a look at the Packages.xz file for the unstable-debug suite, main component, for all amd64 binaries.
$ curl \
https://deb.debian.org/debian-debug/dists/unstable-debug/main/binary-amd64/Packages.xz \
  unxz
This will return the Debian-style rfc2822-like headers, which is an export of the metadata contained inside each .deb file which apt (or other tools that can use the apt repo format) use to fetch information about debs. Let s take a look at the debug headers for the netlabel-tools package in unstable which is a package named netlabel-tools-dbgsym in unstable-debug.
Package: netlabel-tools-dbgsym
Source: netlabel-tools (0.30.0-1)
Version: 0.30.0-1+b1
Installed-Size: 79
Maintainer: Paul Tagliamonte <paultag@debian.org>
Architecture: amd64
Depends: netlabel-tools (= 0.30.0-1+b1)
Description: debug symbols for netlabel-tools
Auto-Built-Package: debug-symbols
Build-Ids: e59f81f6573dadd5d95a6e4474d9388ab2777e2a
Description-md5: a0e587a0cf730c88a4010f78562e6db7
Section: debug
Priority: optional
Filename: pool/main/n/netlabel-tools/netlabel-tools-dbgsym_0.30.0-1+b1_amd64.deb
Size: 62776
SHA256: 0e9bdb087617f0350995a84fb9aa84541bc4df45c6cd717f2157aa83711d0c60
So here, we can parse the package headers in the Packages.xz file, and store, for each Build-Id, the Filename where we can fetch the .deb at. Each .deb contains a number of files but we re only really interested in the files inside the .deb located at or under /usr/lib/debug/.build-id/, which you can find in debugfs under rfc822.rs. It s crude, and very single-purpose, but I m feeling a bit lazy.

Who needs dpkg?! For folks who haven t seen it yet, a .deb file is a special type of .ar file, that contains (usually) three files inside debian-binary, control.tar.xz and data.tar.xz. The core of an .ar file is a fixed size (60 byte) entry header, followed by the specified size number of bytes.
[8 byte .ar file magic]
[60 byte entry header]
[N bytes of data]
[60 byte entry header]
[N bytes of data]
[60 byte entry header]
[N bytes of data]
...
First up was to implement a basic ar parser in ar.rs. Before we get into using it to parse a deb, as a quick diversion, let s break apart a .deb file by hand something that is a bit of a rite of passage (or at least it used to be? I m getting old) during the Debian nm (new member) process, to take a look at where exactly the .debug file lives inside the .deb file.
$ ar x netlabel-tools-dbgsym_0.30.0-1+b1_amd64.deb
$ ls
control.tar.xz debian-binary
data.tar.xz netlabel-tools-dbgsym_0.30.0-1+b1_amd64.deb
$ tar --list -f data.tar.xz   grep '.debug$'
./usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
Since we know quite a bit about the structure of a .deb file, and I had to implement support from scratch anyway, I opted to implement a (very!) basic debfile parser using HTTP Range requests. HTTP Range requests, if supported by the server (denoted by a accept-ranges: bytes HTTP header in response to an HTTP HEAD request to that file) means that we can add a header such as range: bytes=8-68 to specifically request that the returned GET body be the byte range provided (in the above case, the bytes starting from byte offset 8 until byte offset 68). This means we can fetch just the ar file entry from the .deb file until we get to the file inside the .deb we are interested in (in our case, the data.tar.xz file) at which point we can request the body of that file with a final range request. I wound up writing a struct to handle a read_at-style API surface in hrange.rs, which we can pair with ar.rs above and start to find our data in the .deb remotely without downloading and unpacking the .deb at all. After we have the body of the data.tar.xz coming back through the HTTP response, we get to pipe it through an xz decompressor (this kinda sucked in Rust, since a tokio AsyncRead is not the same as an http Body response is not the same as std::io::Read, is not the same as an async (or sync) Iterator is not the same as what the xz2 crate expects; leading me to read blocks of data to a buffer and stuff them through the decoder by looping over the buffer for each lzma2 packet in a loop), and tarfile parser (similarly troublesome). From there we get to iterate over all entries in the tarfile, stopping when we reach our file of interest. Since we can t seek, but gdb needs to, we ll pull it out of the stream into a Cursor<Vec<u8>> in-memory and pass a handle to it back to the user. From here on out its a matter of gluing together a File traited struct in debugfs, and serving the filesystem over TCP using arigato. Done deal!

A quick diversion about compression I was originally hoping to avoid transferring the whole tar file over the network (and therefore also reading the whole debug file into ram, which objectively sucks), but quickly hit issues with figuring out a way around seeking around an xz file. What s interesting is xz has a great primitive to solve this specific problem (specifically, use a block size that allows you to seek to the block as close to your desired seek position just before it, only discarding at most block size - 1 bytes), but data.tar.xz files generated by dpkg appear to have a single mega-huge block for the whole file. I don t know why I would have expected any different, in retrospect. That means that this now devolves into the base case of How do I seek around an lzma2 compressed data stream ; which is a lot more complex of a question. Thankfully, notoriously brilliant tianon was nice enough to introduce me to Jon Johnson who did something super similar adapted a technique to seek inside a compressed gzip file, which lets his service oci.dag.dev seek through Docker container images super fast based on some prior work such as soci-snapshotter, gztool, and zran.c. He also pulled this party trick off for apk based distros over at apk.dag.dev, which seems apropos. Jon was nice enough to publish a lot of his work on this specifically in a central place under the name targz on his GitHub, which has been a ton of fun to read through. The gist is that, by dumping the decompressor s state (window of previous bytes, in-memory data derived from the last N-1 bytes) at specific checkpoints along with the compressed data stream offset in bytes and decompressed offset in bytes, one can seek to that checkpoint in the compressed stream and pick up where you left off creating a similar block mechanism against the wishes of gzip. It means you d need to do an O(n) run over the file, but every request after that will be sped up according to the number of checkpoints you ve taken. Given the complexity of xz and lzma2, I don t think this is possible for me at the moment especially given most of the files I ll be requesting will not be loaded from again especially when I can just cache the debug header by Build-Id. I want to implement this (because I m generally curious and Jon has a way of getting someone excited about compression schemes, which is not a sentence I thought I d ever say out loud), but for now I m going to move on without this optimization. Such a shame, since it kills a lot of the work that went into seeking around the .deb file in the first place, given the debian-binary and control.tar.gz members are so small.

The Good First, the good news right? It works! That s pretty cool. I m positive my younger self would be amused and happy to see this working; as is current day paultag. Let s take debugfs out for a spin! First, we need to mount the filesystem. It even works on an entirely unmodified, stock Debian box on my LAN, which is huge. Let s take it for a spin:
$ mount \
-t 9p \
-o trans=tcp,version=9p2000.u,aname=unstable-debug \
192.168.0.2 \
/usr/lib/debug/.build-id/
And, let s prove to ourselves that this actually mounted before we go trying to use it:
$ mount   grep build-id
192.168.0.2 on /usr/lib/debug/.build-id type 9p (rw,relatime,aname=unstable-debug,access=user,trans=tcp,version=9p2000.u,port=564)
Slick. We ve got an open connection to the server, where our host will keep a connection alive as root, attached to the filesystem provided in aname. Let s take a look at it.
$ ls /usr/lib/debug/.build-id/
00 0d 1a 27 34 41 4e 5b 68 75 82 8E 9b a8 b5 c2 CE db e7 f3
01 0e 1b 28 35 42 4f 5c 69 76 83 8f 9c a9 b6 c3 cf dc E7 f4
02 0f 1c 29 36 43 50 5d 6a 77 84 90 9d aa b7 c4 d0 dd e8 f5
03 10 1d 2a 37 44 51 5e 6b 78 85 91 9e ab b8 c5 d1 de e9 f6
04 11 1e 2b 38 45 52 5f 6c 79 86 92 9f ac b9 c6 d2 df ea f7
05 12 1f 2c 39 46 53 60 6d 7a 87 93 a0 ad ba c7 d3 e0 eb f8
06 13 20 2d 3a 47 54 61 6e 7b 88 94 a1 ae bb c8 d4 e1 ec f9
07 14 21 2e 3b 48 55 62 6f 7c 89 95 a2 af bc c9 d5 e2 ed fa
08 15 22 2f 3c 49 56 63 70 7d 8a 96 a3 b0 bd ca d6 e3 ee fb
09 16 23 30 3d 4a 57 64 71 7e 8b 97 a4 b1 be cb d7 e4 ef fc
0a 17 24 31 3e 4b 58 65 72 7f 8c 98 a5 b2 bf cc d8 E4 f0 fd
0b 18 25 32 3f 4c 59 66 73 80 8d 99 a6 b3 c0 cd d9 e5 f1 fe
0c 19 26 33 40 4d 5a 67 74 81 8e 9a a7 b4 c1 ce da e6 f2 ff
Outstanding. Let s try using gdb to debug a binary that was provided by the Debian archive, and see if it ll load the ELF by build-id from the right .deb in the unstable-debug suite:
$ gdb -q /usr/sbin/netlabelctl
Reading symbols from /usr/sbin/netlabelctl...
Reading symbols from /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug...
(gdb)
Yes! Yes it will!
$ file /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
/usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter *empty*, BuildID[sha1]=e59f81f6573dadd5d95a6e4474d9388ab2777e2a, for GNU/Linux 3.2.0, with debug_info, not stripped

The Bad Linux s support for 9p is mainline, which is great, but it s not robust. Network issues or server restarts will wedge the mountpoint (Linux can t reconnect when the tcp connection breaks), and things that work fine on local filesystems get translated in a way that causes a lot of network chatter for instance, just due to the way the syscalls are translated, doing an ls, will result in a stat call for each file in the directory, even though linux had just got a stat entry for every file while it was resolving directory names. On top of that, Linux will serialize all I/O with the server, so there s no concurrent requests for file information, writes, or reads pending at the same time to the server; and read and write throughput will degrade as latency increases due to increasing round-trip time, even though there are offsets included in the read and write calls. It works well enough, but is frustrating to run up against, since there s not a lot you can do server-side to help with this beyond implementing the 9P2000.L variant (which, maybe is worth it).

The Ugly Unfortunately, we don t know the file size(s) until we ve actually opened the underlying tar file and found the correct member, so for most files, we don t know the real size to report when getting a stat. We can t parse the tarfiles for every stat call, since that d make ls even slower (bummer). Only hiccup is that when I report a filesize of zero, gdb throws a bit of a fit; let s try with a size of 0 to start:
$ ls -lah /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
-r--r--r-- 1 root root 0 Dec 31 1969 /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
$ gdb -q /usr/sbin/netlabelctl
Reading symbols from /usr/sbin/netlabelctl...
Reading symbols from /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug...
warning: Discarding section .note.gnu.build-id which has a section size (24) larger than the file size [in module /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug]
[...]
This obviously won t work since gdb will throw away all our hard work because of stat s output, and neither will loading the real size of the underlying file. That only leaves us with hardcoding a file size and hope nothing else breaks significantly as a result. Let s try it again:
$ ls -lah /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
-r--r--r-- 1 root root 954M Dec 31 1969 /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
$ gdb -q /usr/sbin/netlabelctl
Reading symbols from /usr/sbin/netlabelctl...
Reading symbols from /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug...
(gdb)
Much better. I mean, terrible but better. Better for now, anyway.

Kilroy was here Do I think this is a particularly good idea? I mean; kinda. I m probably going to make some fun 9p arigato-based filesystems for use around my LAN, but I don t think I ll be moving to use debugfs until I can figure out how to ensure the connection is more resilient to changing networks, server restarts and fixes on i/o performance. I think it was a useful exercise and is a pretty great hack, but I don t think this ll be shipping anywhere anytime soon. Along with me publishing this post, I ve pushed up all my repos; so you should be able to play along at home! There s a lot more work to be done on arigato; but it does handshake and successfully export a working 9P2000.u filesystem. Check it out on on my github at arigato, debugfs and also on crates.io and docs.rs. At least I can say I was here and I got it working after all these years.

12 April 2024

NOKUBI Takatsugu: mailman3-web error when upgrading to bookworm

I tried to upgrade bullseye machien to bookworm, so I got the following error:
File /usr/lib/python3/dist-packages/django/contrib/auth/mixins.py , line 5, in
from django.contrib.auth.views import redirect_to_login
File /usr/lib/python3/dist-packages/django/contrib/auth/views.py , line 20, in
from django.utils.http import (
ImportError: cannot import name url_has_allowed_host_and_scheme from django.utils.http (/usr/lib/python3/dist-packages/django/utils/http.py) During handling of the above exception, another exception occurred:
It is similar to #1000810, but it is already closed. My solution is: I tried to send to the report, but it rerutns 550 Unknown or archived bug

9 April 2024

Matthew Palmer: How I Tripped Over the Debian Weak Keys Vulnerability

Those of you who haven t been in IT for far, far too long might not know that next month will be the 16th(!) anniversary of the disclosure of what was, at the time, a fairly earth-shattering revelation: that for about 18 months, the Debian OpenSSL package was generating entirely predictable private keys. The recent xz-stential threat (thanks to @nixCraft for making me aware of that one), has got me thinking about my own serendipitous interaction with a major vulnerability. Given that the statute of limitations has (probably) run out, I thought I d share it as a tale of how huh, that s weird can be a powerful threat-hunting tool but only if you ve got the time to keep pulling at the thread.

Prelude to an Adventure Our story begins back in March 2008. I was working at Engine Yard (EY), a now largely-forgotten Rails-focused hosting company, which pioneered several advances in Rails application deployment. Probably EY s greatest claim to lasting fame is that they helped launch a little code hosting platform you might have heard of, by providing them free infrastructure when they were little more than a glimmer in the Internet s eye. I am, of course, talking about everyone s favourite Microsoft product: GitHub. Since GitHub was in the right place, at the right time, with a compelling product offering, they quickly started to gain traction, and grow their userbase. With growth comes challenges, amongst them the one we re focusing on today: SSH login times. Then, as now, GitHub provided SSH access to the git repos they hosted, by SSHing to git@github.com with publickey authentication. They were using the standard way that everyone manages SSH keys: the ~/.ssh/authorized_keys file, and that became a problem as the number of keys started to grow. The way that SSH uses this file is that, when a user connects and asks for publickey authentication, SSH opens the ~/.ssh/authorized_keys file and scans all of the keys listed in it, looking for a key which matches the key that the user presented. This linear search is normally not a huge problem, because nobody in their right mind puts more than a few keys in their ~/.ssh/authorized_keys, right?
2008-era GitHub giving monkey puppet side-eye to the idea that nobody stores many keys in an authorized_keys file
Of course, as a popular, rapidly-growing service, GitHub was gaining users at a fair clip, to the point that the one big file that stored all the SSH keys was starting to visibly impact SSH login times. This problem was also not going to get any better by itself. Something Had To Be Done. EY management was keen on making sure GitHub ran well, and so despite it not really being a hosting problem, they were willing to help fix this problem. For some reason, the late, great, Ezra Zygmuntowitz pointed GitHub in my direction, and let me take the time to really get into the problem with the GitHub team. After examining a variety of different possible solutions, we came to the conclusion that the least-worst option was to patch OpenSSH to lookup keys in a MySQL database, indexed on the key fingerprint. We didn t take this decision on a whim it wasn t a case of yeah, sure, let s just hack around with OpenSSH, what could possibly go wrong? . We knew it was potentially catastrophic if things went sideways, so you can imagine how much worse the other options available were. Ensuring that this wouldn t compromise security was a lot of the effort that went into the change. In the end, though, we rolled it out in early April, and lo! SSH logins were fast, and we were pretty sure we wouldn t have to worry about this problem for a long time to come. Normally, you d think patching OpenSSH to make mass SSH logins super fast would be a good story on its own. But no, this is just the opening scene.

Chekov s Gun Makes its Appearance Fast forward a little under a month, to the first few days of May 2008. I get a message from one of the GitHub team, saying that somehow users were able to access other users repos over SSH. Naturally, as we d recently rolled out the OpenSSH patch, which touched this very thing, the code I d written was suspect number one, so I was called in to help.
The lineup scene from the movie The Usual Suspects They're called The Usual Suspects for a reason, but sometimes, it really is Keyser S ze
Eventually, after more than a little debugging, we discovered that, somehow, there were two users with keys that had the same key fingerprint. This absolutely shouldn t happen it s a bit like winning the lottery twice in a row1 unless the users had somehow shared their keys with each other, of course. Still, it was worth investigating, just in case it was a web application bug, so the GitHub team reached out to the users impacted, to try and figure out what was going on. The users professed no knowledge of each other, neither admitted to publicising their key, and couldn t offer any explanation as to how the other person could possibly have gotten their key. Then things went from weird to what the ? . Because another pair of users showed up, sharing a key fingerprint but it was a different shared key fingerprint. The odds now have gone from winning the lottery multiple times in a row to as close to this literally cannot happen as makes no difference.
Milhouse from The Simpsons says that We're Through The Looking Glass Here, People
Once we were really, really confident that the OpenSSH patch wasn t the cause of the problem, my involvement in the problem basically ended. I wasn t a GitHub employee, and EY had plenty of other customers who needed my help, so I wasn t able to stay deeply involved in the on-going investigation of The Mystery of the Duplicate Keys. However, the GitHub team did keep talking to the users involved, and managed to determine the only apparent common factor was that all the users claimed to be using Debian or Ubuntu systems, which was where their SSH keys would have been generated. That was as far as the investigation had really gotten, when along came May 13, 2008.

Chekov s Gun Goes Off With the publication of DSA-1571-1, everything suddenly became clear. Through a well-meaning but ultimately disasterous cleanup of OpenSSL s randomness generation code, the Debian maintainer had inadvertently reduced the number of possible keys that could be generated by a given user from bazillions to a little over 32,000. With so many people signing up to GitHub some of them no doubt following best practice and freshly generating a separate key it s unsurprising that some collisions occurred. You can imagine the sense of oooooooh, so that s what s going on! that rippled out once the issue was understood. I was mostly glad that we had conclusive evidence that my OpenSSH patch wasn t at fault, little knowing how much more contact I was to have with Debian weak keys in the future, running a huge store of known-compromised keys and using them to find misbehaving Certificate Authorities, amongst other things.

Lessons Learned While I ve not found a description of exactly when and how Luciano Bello discovered the vulnerability that became CVE-2008-0166, I presume he first came across it some time before it was disclosed likely before GitHub tripped over it. The stable Debian release that included the vulnerable code had been released a year earlier, so there was plenty of time for Luciano to have discovered key collisions and go hmm, I wonder what s going on here? , then keep digging until the solution presented itself. The thought hmm, that s odd , followed by intense investigation, leading to the discovery of a major flaw is also what ultimately brought down the recent XZ backdoor. The critical part of that sequence is the ability to do that intense investigation, though. When I reflect on my brush with the Debian weak keys vulnerability, what sticks out to me is the fact that I didn t do the deep investigation. I wonder if Luciano hadn t found it, how long it might have been before it was found. The GitHub team would have continued investigating, presumably, and perhaps they (or I) would have eventually dug deep enough to find it. But we were all super busy myself, working support tickets at EY, and GitHub feverishly building features and fighting the fires in their rapidly-growing service. As it was, Luciano was able to take the time to dig in and find out what was happening, but just like the XZ backdoor, I feel like we, as an industry, got a bit lucky that someone with the skills, time, and energy was on hand at the right time to make a huge difference. It s a luxury to be able to take the time to really dig into a problem, and it s a luxury that most of us rarely have. Perhaps an understated takeaway is that somehow we all need to wrestle back some time to follow our hunches and really dig into the things that make us go hmm .

Support My Hunches If you d like to help me be able to do intense investigations of mysterious software phenomena, you can shout me a refreshing beverage on ko-fi.
  1. the odds are actually probably more like winning the lottery about twenty times in a row. The numbers involved are staggeringly huge, so it s easiest to just approximate it as really, really unlikely .

8 April 2024

Bastian Blank: Python dataclasses for Deb822 format

Python includes some helping support for classes that are designed to just hold some data and not much more: Data Classes. It uses plain Python type definitions to specify what you can have and some further information for every field. This will then generate you some useful methods, like __init__ and __repr__, but on request also more. But given that those type definitions are available to other code, a lot more can be done. There exists several separate packages to work on data classes. For example you can have data validation from JSON with dacite. But Debian likes a pretty strange format usually called Deb822, which is in fact derived from the RFC 822 format of e-mail messages. Those files includes single messages with a well known format. So I'd like to introduce some Deb822 format support for Python Data Classes. For now the code resides in the Debian Cloud tool. Usage Setup It uses the standard data classes support and several helper functions. Also you need to enable support for postponed evaluation of annotations.
from __future__ import annotations
from dataclasses import dataclass
from dataclasses_deb822 import read_deb822, field_deb822
Class definition start Data classes are just normal classes, just with a decorator.
@dataclass
class Package:
Field definitions You need to specify the exact key to be used for this field.
    package: str = field_deb822('Package')
    version: str = field_deb822('Version')
    arch: str = field_deb822('Architecture')
Default values are also supported.
    multi_arch: Optional[str] = field_deb822(
        'Multi-Arch',
        default=None,
    )
Reading files
for p in read_deb822(Package, sys.stdin, ignore_unknown=True):
    print(p)
Full example
from __future__ import annotations
from dataclasses import dataclass
from debian_cloud_images.utils.dataclasses_deb822 import read_deb822, field_deb822
from typing import Optional
import sys
@dataclass
class Package:
    package: str = field_deb822('Package')
    version: str = field_deb822('Version')
    arch: str = field_deb822('Architecture')
    multi_arch: Optional[str] = field_deb822(
        'Multi-Arch',
        default=None,
    )
for p in read_deb822(Package, sys.stdin, ignore_unknown=True):
    print(p)
Known limitations

5 April 2024

Emanuele Rocca: PGP keys on Yubikey, with a side of Mutt

Here are my notes about copying PGP keys to external hardware devices such as Yubikeys. Let me begin by saying that the gpg tools are pretty bad at this.
MAKE A COUPLE OF BACKUPS OF ~/.gnupg/ TO DIFFERENT ENCRYPTED USB STICKS BEFORE YOU START. GPG WILL MESS UP YOUR KEYS. SERIOUSLY.
For example, would you believe me if I said that saving changes results in the removal of your private key? Well check this out.
Now that you have multiple safe, offline backups of your keys, here are my notes.
apt install yubikey-manager scdaemon
Plug the Yubikey in, see if it s recognized properly:
ykman list
gpg --card-status
Change the default PIN (123456) and Admin PIN (12345678):
gpg --card-edit
gpg/card> admin
gpg/card> passwd
Look at the openpgp information and change the maximum number of retries, if you like. I have seen this failing a couple of times, unplugging the Yubikey and putting it back in worked.
ykman openpgp info
ykman openpgp access set-retries 7 7 7
Copy your keys. MAKE A BACKUP OF ~/.gnupg/ BEFORE YOU DO THIS.
gpg --edit-key $KEY_ID
gpg> keytocard # follow the prompts to copy the first key
Now choose the next key and copy that one too. Repeat till all subkeys are copied.
gpg> key 1
gpg> keytocard
Typing gpg --card-status you should be able to see all your keys on the Yubikey now.

Using the key on another machine
How do you use your PGP keys on the Yubikey on other systems?
Go to another system, if it does have a ~/.gnupg directory already move it somewhere else.
apt install scdaemon
Import your public key:
gpg -k
gpg --keyserver pgp.mit.edu --recv-keys $KEY_ID
Check the fingerprint and if it is indeed your key say you trust it:
gpg --edit-key $KEY_ID
> trust
> 5
> y
> save
Now try gpg --card-status and gpg --list-secret-keys, you should be able to see your keys. Try signing something, it should work.
gpg --output /tmp/x.out --sign /etc/motd
gpg --verify /tmp/x.out

Using the Yubikey with Mutt
If you re using mutt with IMAP, there is a very simple trick to safely store your password on disk. Create an encrypted file with your IMAP password:
echo SUPERSECRET   gpg --encrypt > ~/.mutt_password.gpg
Add the following to ~/.muttrc:
set imap_pass= gpg --decrypt ~/.mutt_password.gpg 
With the above, mutt now prompts you to insert the Yubikey and type your PIN in order to connect to the IMAP server.

3 April 2024

Arnaud Rebillout: Firefox: Moving from the Debian package to the Flatpak app (long-term?)

First, thanks to Samuel Henrique for giving notice of recent Firefox CVEs in Debian testing/unstable. At the time I didn't want to upgrade my system (Debian Sid) due to the ongoing t64 transition transition, so I decided I could install the Firefox Flatpak app instead, and why not stick to it long-term? This blog post details all the steps, if ever others want to go the same road. Flatpak Installation Disclaimer: this section is hardly anything more than a copy/paste of the official documentation, and with time it will get outdated, so you'd better follow the official doc. First thing first, let's install Flatpak:
$ sudo apt update
$ sudo apt install flatpak
Then the next step is to add the Flathub remote repository, from where we'll get our Flatpak applications:
$ flatpak remote-add --if-not-exists flathub https://dl.flathub.org/repo/flathub.flatpakrepo
And that's all there is to it! Now come the optional steps. For GNOME and KDE users, you might want to install a plugin for the software manager specific to your desktop, so that it can support and manage Flatpak apps:
$ which -s gnome-software  && sudo apt install gnome-software-plugin-flatpak
$ which -s plasma-discover && sudo apt install plasma-discover-backend-flatpak
And here's an additional check you can do, as it's something that did bite me in the past: missing xdg-portal-* packages, that are required for Flatpak applications to communicate with the desktop environment. Just to be sure, you can check the output of apt search '^xdg-desktop-portal' to see what's available, and compare with the output of dpkg -l grep xdg-desktop-portal. As you can see, if you're a GNOME or KDE user, there's a portal backend for you, and it should be installed. For reference, this is what I have on my GNOME desktop at the moment:
$ dpkg -l   grep xdg-desktop-portal   awk ' print $2 '
xdg-desktop-portal
xdg-desktop-portal-gnome
xdg-desktop-portal-gtk
Install the Firefox Flatpak app This is trivial, but still, there's a question I've always asked myself: should I install applications system-wide (aka. flatpak --system, the default) or per-user (aka. flatpak --user)? Turns out, this questions is answered in the Flatpak documentation:
Flatpak commands are run system-wide by default. If you are installing applications for day-to-day usage, it is recommended to stick with this default behavior.
Armed with this new knowledge, let's install the Firefox app:
$ flatpak install flathub org.mozilla.firefox
And that's about it! We can give it a go already:
$ flatpak run org.mozilla.firefox
Data migration At this point, running Firefox via Flatpak gives me an "empty" Firefox. That's not what I want, instead I want my usual Firefox, with a gazillion of tabs already opened, a few extensions, bookmarks and so on. As it turns out, Mozilla provides a brief doc for data migration, and it's as simple as moving Firefox data directory around! To clarify, we'll be copying data: Make sure that all Firefox instances are closed, then proceed:
# BEWARE! Below I'm erasing data!
$ rm -fr ~/.var/app/org.mozilla.firefox/.mozilla/firefox/
$ cp -a ~/.mozilla/firefox/ ~/.var/app/org.mozilla.firefox/.mozilla/
To avoid confusing myself, it's also a good idea to rename the local data directory:
$ mv ~/.mozilla/firefox ~/.mozilla/firefox.old.$(date --iso-8601=date)
At this point, flatpak run org.mozilla.firefox takes me to my "usual" everyday Firefox, with all its tabs opened, pinned, bookmarked, etc. More integration? After following all the steps above, I must say that I'm 99% happy. So far, everything works as before, I didn't hit any issue, and I don't even notice that Firefox is running via Flatpak, it's completely transparent. So where's the 1% of unhappiness? The Run a Command dialog from GNOME, the one that shows up via the keyboard shortcut <Alt+F2>. This is how I start my GUI applications, and I usually run two Firefox instances in parallel (one for work, one for personal), using the firefox -p <profile> command. Given that I ran apt purge firefox before (to avoid confusing myself with two installations of Firefox), now the right (and only) way to start Firefox from a command-line is to type flatpak run org.mozilla.firefox -p <profile>. Typing that every time is way too cumbersome, so I need something quicker. Seems like the most straightforward is to create a wrapper script:
$ cat /usr/local/bin/firefox 
#!/bin/sh
exec flatpak run org.mozilla.firefox "$@"
And now I can just hit <Alt+F2> and type firefox -p <profile> to start Firefox with the profile I want, just as before. Neat! Looking forward: system updates I usually update my system manually every now and then, via the well-known pair of commands:
$ sudo apt update
$ sudo apt full-upgrade
The downside of introducing Flatpak, ie. introducing another package manager, is that I'll need to learn new commands to update the software that comes via this channel. Fortunately, there's really not much to learn. From flatpak-update(1):
flatpak update [OPTION...] [REF...] Updates applications and runtimes. [...] If no REF is given, everything is updated, as well as appstream info for all remotes.
Could it be that simple? Apparently yes, the Flatpak equivalent of the two apt commands above is just:
$ flatpak update
Going forward, my options are:
  1. Teach myself to run flatpak update additionally to apt update, manually, everytime I update my system.
  2. Go crazy: let something automatically update my Flatpak apps, in my back and without my consent.
I'm actually tempted to go for option 2 here, and I wonder if GNOME Software will do that for me, provided that I installed gnome-software-plugin-flatpak, and that I checked Software Updates -> Automatic in the Settings (which I did). However, I didn't find any documentation regarding what this setting really does, so I can't say if it will only download updates, or if it will also install it. I'd be happy if it automatically installs new version of Flatpak apps, but at the same time I'd be very unhappy if it automatically upgrades my Debian system... So we'll see. Enough for today, hope this blog post was useful!

1 April 2024

Simon Josefsson: Towards reproducible minimal source code tarballs? On *-src.tar.gz

While the work to analyze the xz backdoor is in progress, several ideas have been suggested to improve the software supply chain ecosystem. Some of those ideas are good, some of the ideas are at best irrelevant and harmless, and some suggestions are plain bad. I d like to attempt to formalize two ideas, which have been discussed before, but the context in which they can be appreciated have not been as clear as it is today.
  1. Reproducible tarballs. The idea is that published source tarballs should be possible to reproduce independently somehow, and that this should be continuously tested and verified preferrably as part of the upstream project continuous integration system (e.g., GitHub action or GitLab pipeline). While nominally this looks easy to achieve, there are some complex matters in this, for example: what timestamps to use for files in the tarball? I ve brought up this aspect before.
  2. Minimal source tarballs without generated vendor files. Most GNU Autoconf/Automake-based tarballs pre-generated files which are important for bootstrapping on exotic systems that does not have the required dependencies. For the bootstrapping story to succeed, this approach is important to support. However it has become clear that this practice raise significant costs and risks. Most modern GNU/Linux distributions have all the required dependencies and actually prefers to re-build everything from source code. These pre-generated extra files introduce uncertainty to that process.
My strawman proposal to improve things is to define new tarball format *-src.tar.gz with at least the following properties:
  1. The tarball should allow users to build the project, which is the entire purpose of all this. This means that at least all source code for the project has to be included.
  2. The tarballs should be signed, for example with PGP or minisign.
  3. The tarball should be possible to reproduce bit-by-bit by a third party using upstream s version controlled sources and a pointer to which revision was used (e.g., git tag or git commit).
  4. The tarball should not require an Internet connection to download things.
    • Corollary: every external dependency either has to be explicitly documented as such (e.g., gcc and GnuTLS), or included in the tarball.
    • Observation: This means including all *.po gettext translations which are normally downloaded when building from version controlled sources.
  5. The tarball should contain everything required to build the project from source using as much externally released versioned tooling as possible. This is the minimal property lacking today.
    • Corollary: This means including a vendored copy of OpenSSL or libz is not acceptable: link to them as external projects.
    • Open question: How about non-released external tooling such as gnulib or autoconf archive macros? This is a bit more delicate: most distributions either just package one current version of gnulib or autoconf archive, not previous versions. While this could change, and distributions could package the gnulib git repository (up to some current version) and the autoconf archive git repository and packages were set up to extract the version they need (gnulib s ./bootstrap already supports this via the gnulib-refdir parameter), this is not normally in place.
    • Suggested Corollary: The tarball should contain content from git submodule s such as gnulib and the necessary Autoconf archive M4 macros required by the project.
  6. Similar to how the GNU project specify the ./configure interface we need a documented interface for how to bootstrap the project. I suggest to use the already well established idiom of running ./bootstrap to set up the package to later be able to be built via ./configure. Of course, some projects are not using the autotool ./configure interface and will not follow this aspect either, but like most build systems that compete with autotools have instructions on how to build the project, they should document similar interfaces for bootstrapping the source tarball to allow building.
If tarballs that achieve the above goals were available from popular upstream projects, distributions could more easily use them instead of current tarballs that include pre-generated content. The advantage would be that the build process is not tainted by unnecessary files. We need to develop tools for maintainers to create these tarballs, similar to make dist that generate today s foo-1.2.3.tar.gz files. I think one common argument against this approach will be: Why bother with all that, and just use git-archive outputs? Or avoid the entire tarball approach and move directly towards version controlled check outs and referring to upstream releases as git URL and commit tag or id. One problem with this is that SHA-1 is broken, so placing trust in a SHA-1 identifier is simply not secure. Another counter-argument is that this optimize for packagers benefits at the cost of upstream maintainers: most upstream maintainers do not want to store gettext *.po translations in their source code repository. A compromise between the needs of maintainers and packagers is useful, so this *-src.tar.gz tarball approach is the indirection we need to solve that. Update: In my experiment with source-only tarballs for Libntlm I actually did use git-archive output. What do you think?

28 March 2024

Joey Hess: the vulture in the coal mine

Turns out that VPS provider Vultr's terms of service were quietly changed some time ago to give them a "perpetual, irrevocable" license to use content hosted there in any way, including modifying it and commercializing it "for purposes of providing the Services to you." This is very similar to changes that Github made to their TOS in 2017. Since then, Github has been rebranded as "The world s leading AI-powered developer platform". The language in their TOS now clearly lets them use content stored in Github for training AI. (Probably this is their second line of defense if the current attempt to legitimise copyright laundering via generative AI fails.) Vultr is currently in damage control mode, accusing their concerned customers of spreading "conspiracy theories" (-- founder David Aninowsky) and updating the TOS to remove some of the problem language. Although it still allows them to "make derivative works", so could still allow their AI division to scrape VPS images for training data. Vultr claims this was the legalese version of technical debt, that it only ever applied to posts in a forum (not supported by the actual TOS language) and basically that they and their lawyers are incompetant but not malicious. Maybe they are indeed incompetant. But even if I give them the benefit of the doubt, I expect that many other VPS providers, especially ones targeting non-corporate customers, are watching this closely. If Vultr is not significantly harmed by customers jumping ship, if the latest TOS change is accepted as good enough, then other VPS providers will know that they can try this TOS trick too. If Vultr's AI division does well, others will wonder to what extent it is due to having all this juicy training data. For small self-hosters, this seems like a good time to make sure you're using a VPS provider you can actually trust to not be eyeing your disk image and salivating at the thought of stripmining it for decades of emails. Probably also worth thinking about moving to bare metal hardware, perhaps hosted at home. I wonder if this will finally make it worthwhile to mess around with VPS TPMs?

Scarlett Gately Moore: Kubuntu, KDE Report. In Loving Memory of my Son.

Personal: As many of you know, I lost my beloved son March 9th. This has hit me really hard, but I am staying strong and holding on to all the wonderful memories I have. He grew up to be an amazing man, devoted christian and wonderful father. He was loved by everyone who knew him and will be truly missed by us all. I have had folks ask me how they can help. He left behind his 7 year old son Mason. Mason was Billy s world and I would like to make sure Mason is taken care of. I have set up a gofundme for Mason and all proceeds will go to the future care of him. https://gofund.me/25dbff0c

Work report Kubuntu: Bug bashing! I am triaging allthebugs for Plasma which can be seen here: https://bugs.launchpad.net/plasma-5.27/+bug/2053125 I am happy to report many of the remaining bugs have been fixed in the latest bug fix release 5.27.11. I prepared https://kde.org/announcements/plasma/5/5.27.11/ and Rik uploaded to archive, thank you. Unfortunately, this and several other key fixes are stuck in transition do to the time_t64 transition, which you can read about here: https://wiki.debian.org/ReleaseGoals/64bit-time . It is the biggest transition in Debian/Ubuntu history and it couldn t come at a worst time. We are aware our ISO installer is currently broken, calamares is one of those things stuck in this transition. There is a workaround in the comments of the bug report: https://bugs.launchpad.net/ubuntu/+source/calamares/+bug/2054795 Fixed an issue with plasma-welcome. Found the fix for emojis and Aaron has kindly moved this forward with the fontconfig maintainer. Thanks! I have received an https://kfocus.org/spec/spec-ir14.html laptop and it is truly a great machine and is now my daily driver. A big thank you to the Kfocus team! I can t wait to show it off at https://linuxfestnorthwest.org/. KDE Snaps: You will see the activity in this ramp back up as the KDEneon Core project is finally a go! I will participate in the project with part time status and get everyone in the Enokia team up to speed with my snap knowledge, help prepare the qt6/kf6 transition, package plasma, and most importantly I will focus on documentation for future contributors. I have created the ( now split ) qt6 with KDE patchset support and KDE frameworks 6 SDK and runtime snaps. I have made the kde-neon-6 extension and the PR is in: https://github.com/canonical/snapcraft/pull/4698 . Future work on the extension will include multiple versions track support and core24 support.

I have successfully created our first qt6/kf6 snap ark. They will show showing up in the store once all the required bits have been merged and published. Thank you for stopping by. ~Scarlett

26 March 2024

Emmanuel Kasper: Adding a private / custom Certificate Authority to the firefox trust store

Today at $WORK I needed to add the private company Certificate Authority (CA) to Firefox, and I found the steps were unnecessarily complex. Time to blog about that, and I also made a Debian wiki article of that post, so that future generations can update the information, when Firefox 742 is released on Debian 17. The cacert certificate authority is not included in Debian and Firefox, and is thus a good example of adding a private CA. Note that this does not mean I specifically endorse that CA.
  • Test that SSL connections to a site signed by the private CA is failing
$ gnutls-cli wiki.cacert.org:443
...
- Status: The certificate is NOT trusted. The certificate issuer is unknown. 
*** PKI verification of server certificate failed...
*** Fatal error: Error in the certificate.
  • Download the private CA
$ wget http://www.cacert.org/certs/root_X0F.crt
  • test that a connection works with the private CA
$ gnutls-cli --x509cafile root_X0F.crt wiki.cacert.org:443
...
- Status: The certificate is trusted. 
- Description: (TLS1.2-X.509)-(ECDHE-SECP256R1)-(RSA-SHA256)-(AES-256-GCM)
- Session ID: 37:56:7A:89:EA:5F:13:E8:67:E4:07:94:4B:52:23:63:1E:54:31:69:5D:70:17:3C:D0:A4:80:B0:3A:E5:22:B3
- Options: safe renegotiation,
- Handshake was completed
...
  • add the private CA to the Debian trust store located in /etc/ssl/certs/ca-certificates.crt
$ sudo cp root_X0F.crt /usr/local/share/ca-certificates/cacert-org-root-ca.crt
$ sudo update-ca-certificates --verbose
...
Adding debian:cacert-org-root-ca.pem
...
  • verify that we can connect without passing the private CA on the command line
$ gnutls-cli wiki.cacert.org:443
... 
 - Status: The certificate is trusted.
  • At that point most applications are able to connect to systems with a certificate signed by the private CA (curl, Gnome builtin Browser ). However Firefox is using its own trust store and will still display a security error if connecting to https://wiki.cacert.org. To make Firefox trust the Debian trust store, we need to add a so called security device, in fact an extra library wrapping the Debian trust store. The library will wrap the Debian trust store in the PKCS#11 industry format that Firefox supports.
  • install the pkcs#11 wrapping library and command line tools
$ sudo apt install p11-kit p11-kit-modules
  • verify that the private CA is accessible via PKCS#11
$ trust list   grep --context 2 'CA Cert'
pkcs11:id=%16%B5%32%1B%D4%C7%F3%E0%E6%8E%F3%BD%D2%B0%3A%EE%B2%39%18%D1;type=cert
    type: certificate
    label: CA Cert Signing Authority
    trust: anchor
    category: authority
  • now we need to add a new security device in Firefox pointing to the pkcs11 trust store. The pkcs11 trust store is located in /usr/lib/x86_64-linux-gnu/pkcs11/p11-kit-trust.so
$ dpkg --listfiles p11-kit-modules   grep trust
/usr/lib/x86_64-linux-gnu/pkcs11/p11-kit-trust.so
  • in Firefox (tested in version 115 esr), go to Settings -> Privacy & Security -> Security -> Security Devices.
    Then click Load , in the popup window use My local trust as a module name, and /usr/lib/x86_64-linux-gnu/pkcs11/p11-kit-trust.so as a module filename. After adding the module, you should see it in the list of Security Devices, having /etc/ssl/certs/ca-certificates.crt as a description.
  • now restart Firefox and you should be able to browse https://wiki.cacert.org without security errors

21 March 2024

Ravi Dwivedi: Thailand Trip

This post is the second and final part of my Malaysia-Thailand trip. Feel free to check out the Malaysia part here if you haven t already. Kuala Lumpur to Bangkok is around 1500 km by road, and so I took a Malaysian Airlines flight to travel to Bangkok. The flight staff at the Kuala Lumpur only asked me for a return/onward flight and Thailand immigration asked a few questions but did not check any documents (obviously they checked and stamped my passport ;)). The currency of Thailand is the Thai baht, and 1 Thai baht = 2.5 Indian Rupees. The Thailand time is 1.5 hours ahead of Indian time (For example, if it is 12 noon in India, it will be 13:30 in Thailand). I landed in Bangkok at around 3 PM local time. Fletcher was in Bangkok that time, leaving for Pattaya and we had booked the same hostel. So I took a bus to Pattaya from the airport. The next bus for which the tickets were available was at 7 PM, so I took tickets for that one. The bus ticket cost was 143 Thai Baht. I didn t buy SIM at the airport, thinking there must be better deals in the city. As a consequence, there was no way to contact Fletcher through internet. Although I had a few minutes call remaining out of my international roaming pack.
A welcome sign at Bangkok's Suvarnabhumi airport.
Bus from Suvarnabhumi Airport to Jomtien Beach in Pattaya.
Our accommodation was near Jomtien beach, so I got off at the last stop, as the bus terminates at the Jomtien beach. Then I decided to walk towards my accommodation. I was using OsmAnd for navigation. However, the place was not marked on OpenStreetMap, and it turned out I missed the street my hostel was on and walked around 1 km further as I was chasing a similarly named incorrect hostel on OpenStreetMap. Then I asked for help from two men sitting at a caf . One of them said he will help me find the street my hostel is on. So, I walked with him, and he told me he lives in Thailand for many years, but he is from Kuwait. He also gave me valuable information. Like, he told me about shared hail-and-ride songthaews which run along the Jomtien Second Road and charge 10 Baht for any distance on their route. This tip significantly reduced our expenses. Further, he suggested me 7-Eleven shops for buying a local SIM. Like Malaysia, Thailand has 24/7 7-Eleven convenience stores, a lot of them not even 100 m apart. The Kuwaiti person dropped me at the address where my hostel was. I tried searching for a person in-charge of that hostel, and soon I realized there was no reception. After asking for help from locals for some time, I bumped into Fletcher, who also came to this address and was searching for the same. After finding a friend, I felt a sigh of relief. Adjacent to the property, there was a hairdresser shop. We went there and asked about this property. The woman called the owner, and she also told us the required passcodes to go inside. Our accommodation was in a room on the second floor, which required us to put a passcode for opening. We entered the passcode and entered the room. So, we stayed at this hostel which had no reception. Due to this, it took 2 hours to find our room and enter. It reminded me of a difficult experience I had in Albania, where me and Akshat were not able to find our apartment in one of the hottest days and the owner didn t know our language. Traveling from the place where the bus dropped me to the hostel, I saw streets were filled with bars and massage parlors, which was expected. Prostitutes were everywhere. We went out at night towards the beach and also roamed around in 7-Elevens to buy a SIM card for myself. I got a SIM for 7 day unlimited internet for 399 baht. Turns out that the rates of SIM cards at the airport were not so different from inside the city.
Road near Jomtien beach in Pattaya
Photo of a songthaew in Pattaya. There are shared songthaews which run along Jomtien Second road and takes 10 bath to anywhere on the route.
Jomtien Beach in Pattaya.
In terms of speaking English, locals didn t know English at all in both Pattaya and Bangkok. I normally don t expect locals to know English in a non-English speaking country, but the fact that Bangkok is one of the most visited places by tourists made me expect locals to know some English. Talking to locals is an integral part of travel for me, which I couldn t do a lot in Thailand. This aspect is much more important for me than going to touristy places. So, we were in Pattaya. Next morning, Fletcher and I went to Tiger park using shared songthaew. After that, we planned to visit Pattaya Floating market which is near the Tiger Park, but we felt the ticket prices were higher than it was worth. Fletcher had to leave for Bangkok on that day. I suggested him to go to Suvarnabhumi Airport from the Jomtien beach bus terminal (this was the route I took the last day in opposite direction) to avoid traffic congestion inside Bangkok, as he can follow up with metro once he reaches the airport. From the floating market, we were walking in sweltering heat to reach the Jomtien beach. I tried asking for a lift and eventually got successful as a scooty stopped, and surprisingly the person gave a ride to both of us. He was from Delhi, so maybe that s the reason he stopped for us. Then we took a songthaew to the bus terminal and after having lunch, Fletcher left for Bangkok.
A welcome sign at Pattaya Floating market.
This Korean Vegetasty noodles pack was yummy and was available at many 7-Eleven stores.
Next day I went to Bangkok, but Fletcher already left for Kuala Lumpur. Here I had booked a private room in a hotel (instead of a hostel) for four nights, mainly because of my luggage. This costed 5600 INR for four nights. It was 2 km from the metro station, which I used to walk both sides. In Bangkok, I visited Sukhumvit and Siam by metro. Going to some areas require crossing the Chao Phraya river. For this, I took Chao Phraya Express Boat for going to places like Khao San road and Wat Arun. I would recommend taking the boat ride as it had very good views. In Bangkok, I met a person from Pakistan staying in my hotel and so here also I got some company. But by the time I met him, my days were almost over. So, we went to a random restaurant selling Indian food where we ate some paneer dish with naan and that restaurant person was from Myanmar.
Wat Arun temple stamps your hand upon entry
Wat Arun temple
Khao San Road
A food stall at Khao San Road
Chao Phraya Express Boat
For eating, I mainly relied on fruits and convenience stores. Bananas were very tasty. This was the first time I saw banana flesh being yellow. Mangoes were delicious and pineapples were smaller and flavorful. I also ate Rose Apple, which I never had before. I had Chhole Kulche once in Sukhumvit. That was a little expensive as it costed 164 baht. I also used to buy premix coffee packets from 7-Eleven convenience stores and prepare them inside the stores.
Banana with yellow flesh
Fruits at a stall in Bangkok
Trimmed pineapples from Thailand.
Corn in Bangkok.
A board showing coffee menu at a 7-Eleven store along with rates in Pattaya.
In this section of 7-Eleven, you can buy a premix coffee and mix it with hot water provided at the store to prepare.
My booking from Bangkok to Delhi was in Air India flight, and they were serving alcohol in the flight. I chose red wine, and this was my first time having alcohol in a flight.
Red wine being served in Air India

Notes
  • In this whole trip spanning two weeks, I did not pay for drinking water (except for once in Pattaya which was 9 baht) and toilets. Bangkok and Kuala Lumpur have plenty of malls where you should find a free-of-cost toilet nearby. For drinking water, I relied mainly on my accommodation providing refillable water for my bottle.
  • Thailand seemed more expensive than Malaysia on average. Malaysia had discounted price due to the Chinese New year.
  • I liked Pattaya more than Bangkok. Maybe because Pattaya has beach and Bangkok doesn t. Pattaya seemed more lively, and I could meet and talk to a few people as opposed to Bangkok.
  • Chao Phraya River express boat costs 150 baht for one day where you can hop on and off to any boat.

18 March 2024

Simon Josefsson: Apt archive mirrors in Git-LFS

My effort to improve transparency and confidence of public apt archives continues. I started to work on this in Apt Archive Transparency in which I mention the debdistget project in passing. Debdistget is responsible for mirroring index files for some public apt archives. I ve realized that having a publicly auditable and preserved mirror of the apt repositories is central to being able to do apt transparency work, so the debdistget project has become more central to my project than I thought. Currently I track Trisquel, PureOS, Gnuinos and their upstreams Ubuntu, Debian and Devuan. Debdistget download Release/Package/Sources files and store them in a git repository published on GitLab. Due to size constraints, it uses two repositories: one for the Release/InRelease files (which are small) and one that also include the Package/Sources files (which are large). See for example the repository for Trisquel release files and the Trisquel package/sources files. Repositories for all distributions can be found in debdistutils archives GitLab sub-group. The reason for splitting into two repositories was that the git repository for the combined files become large, and that some of my use-cases only needed the release files. Currently the repositories with packages (which contain a couple of months worth of data now) are 9GB for Ubuntu, 2.5GB for Trisquel/Debian/PureOS, 970MB for Devuan and 450MB for Gnuinos. The repository size is correlated to the size of the archive (for the initial import) plus the frequency and size of updates. Ubuntu s use of Apt Phased Updates (which triggers a higher churn of Packages file modifications) appears to be the primary reason for its larger size. Working with large Git repositories is inefficient and the GitLab CI/CD jobs generate quite some network traffic downloading the git repository over and over again. The most heavy user is the debdistdiff project that download all distribution package repositories to do diff operations on the package lists between distributions. The daily job takes around 80 minutes to run, with the majority of time is spent on downloading the archives. Yes I know I could look into runner-side caching but I dislike complexity caused by caching. Fortunately not all use-cases requires the package files. The debdistcanary project only needs the Release/InRelease files, in order to commit signatures to the Sigstore and Sigsum transparency logs. These jobs still run fairly quickly, but watching the repository size growth worries me. Currently these repositories are at Debian 440MB, PureOS 130MB, Ubuntu/Devuan 90MB, Trisquel 12MB, Gnuinos 2MB. Here I believe the main size correlation is update frequency, and Debian is large because I track the volatile unstable. So I hit a scalability end with my first approach. A couple of months ago I solved this by discarding and resetting these archival repositories. The GitLab CI/CD jobs were fast again and all was well. However this meant discarding precious historic information. A couple of days ago I was reaching the limits of practicality again, and started to explore ways to fix this. I like having data stored in git (it allows easy integration with software integrity tools such as GnuPG and Sigstore, and the git log provides a kind of temporal ordering of data), so it felt like giving up on nice properties to use a traditional database with on-disk approach. So I started to learn about Git-LFS and understanding that it was able to handle multi-GB worth of data that looked promising. Fairly quickly I scripted up a GitLab CI/CD job that incrementally update the Release/Package/Sources files in a git repository that uses Git-LFS to store all the files. The repository size is now at Ubuntu 650kb, Debian 300kb, Trisquel 50kb, Devuan 250kb, PureOS 172kb and Gnuinos 17kb. As can be expected, jobs are quick to clone the git archives: debdistdiff pipelines went from a run-time of 80 minutes down to 10 minutes which more reasonable correlate with the archive size and CPU run-time. The LFS storage size for those repositories are at Ubuntu 15GB, Debian 8GB, Trisquel 1.7GB, Devuan 1.1GB, PureOS/Gnuinos 420MB. This is for a couple of days worth of data. It seems native Git is better at compressing/deduplicating data than Git-LFS is: the combined size for Ubuntu is already 15GB for a couple of days data compared to 8GB for a couple of months worth of data with pure Git. This may be a sub-optimal implementation of Git-LFS in GitLab but it does worry me that this new approach will be difficult to scale too. At some level the difference is understandable, Git-LFS probably store two different Packages files around 90MB each for Trisquel as two 90MB files, but native Git would store it as one compressed version of the 90MB file and one relatively small patch to turn the old files into the next file. So the Git-LFS approach surprisingly scale less well for overall storage-size. Still, the original repository is much smaller, and you usually don t have to pull all LFS files anyway. So it is net win. Throughout this work, I kept thinking about how my approach relates to Debian s snapshot service. Ultimately what I would want is a combination of these two services. To have a good foundation to do transparency work I would want to have a collection of all Release/Packages/Sources files ever published, and ultimately also the source code and binaries. While it makes sense to start on the latest stable releases of distributions, this effort should scale backwards in time as well. For reproducing binaries from source code, I need to be able to securely find earlier versions of binary packages used for rebuilds. So I need to import all the Release/Packages/Sources packages from snapshot into my repositories. The latency to retrieve files from that server is slow so I haven t been able to find an efficient/parallelized way to download the files. If I m able to finish this, I would have confidence that my new Git-LFS based approach to store these files will scale over many years to come. This remains to be seen. Perhaps the repository has to be split up per release or per architecture or similar. Another factor is storage costs. While the git repository size for a Git-LFS based repository with files from several years may be possible to sustain, the Git-LFS storage size surely won t be. It seems GitLab charges the same for files in repositories and in Git-LFS, and it is around $500 per 100GB per year. It may be possible to setup a separate Git-LFS backend not hosted at GitLab to serve the LFS files. Does anyone know of a suitable server implementation for this? I had a quick look at the Git-LFS implementation list and it seems the closest reasonable approach would be to setup the Gitea-clone Forgejo as a self-hosted server. Perhaps a cloud storage approach a la S3 is the way to go? The cost to host this on GitLab will be manageable for up to ~1TB ($5000/year) but scaling it to storing say 500TB of data would mean an yearly fee of $2.5M which seems like poor value for the money. I realized that ultimately I would want a git repository locally with the entire content of all apt archives, including their binary and source packages, ever published. The storage requirements for a service like snapshot (~300TB of data?) is today not prohibitly expensive: 20TB disks are $500 a piece, so a storage enclosure with 36 disks would be around $18.000 for 720TB and using RAID1 means 360TB which is a good start. While I have heard about ~TB-sized Git-LFS repositories, would Git-LFS scale to 1PB? Perhaps the size of a git repository with multi-millions number of Git-LFS pointer files will become unmanageable? To get started on this approach, I decided to import a mirror of Debian s bookworm for amd64 into a Git-LFS repository. That is around 175GB so reasonable cheap to host even on GitLab ($1000/year for 200GB). Having this repository publicly available will make it possible to write software that uses this approach (e.g., porting debdistreproduce), to find out if this is useful and if it could scale. Distributing the apt repository via Git-LFS would also enable other interesting ideas to protecting the data. Consider configuring apt to use a local file:// URL to this git repository, and verifying the git checkout using some method similar to Guix s approach to trusting git content or Sigstore s gitsign. A naive push of the 175GB archive in a single git commit ran into pack size limitations: remote: fatal: pack exceeds maximum allowed size (4.88 GiB) however breaking up the commit into smaller commits for parts of the archive made it possible to push the entire archive. Here are the commands to create this repository: git init
git lfs install
git lfs track 'dists/**' 'pool/**'
git add .gitattributes
git commit -m"Add Git-LFS track attributes." .gitattributes
time debmirror --method=rsync --host ftp.se.debian.org --root :debian --arch=amd64 --source --dist=bookworm,bookworm-updates --section=main --verbose --diff=none --keyring /usr/share/keyrings/debian-archive-keyring.gpg --ignore .git .
git add dists project
git commit -m"Add." -a
git remote add origin git@gitlab.com:debdistutils/archives/debian/mirror.git
git push --set-upstream origin --all
for d in pool//; do
echo $d;
time git add $d;
git commit -m"Add $d." -a
git push
done
The resulting repository size is around 27MB with Git LFS object storage around 174GB. I think this approach would scale to handle all architectures for one release, but working with a single git repository for all releases for all architectures may lead to a too large git repository (>1GB). So maybe one repository per release? These repositories could also be split up on a subset of pool/ files, or there could be one repository per release per architecture or sources. Finally, I have concerns about using SHA1 for identifying objects. It seems both Git and Debian s snapshot service is currently using SHA1. For Git there is SHA-256 transition and it seems GitLab is working on support for SHA256-based repositories. For serious long-term deployment of these concepts, it would be nice to go for SHA256 identifiers directly. Git-LFS already uses SHA256 but Git internally uses SHA1 as does the Debian snapshot service. What do you think? Happy Hacking!

Next.