Search Results: "mj"

20 September 2022

Matthew Garrett: Handling WebAuthn over remote SSH connections

Being able to SSH into remote machines and do work there is great. Using hardware security tokens for 2FA is also great. But trying to use them both at the same time doesn't work super well, because if you hit a WebAuthn request on the remote machine it doesn't matter how much you mash your token - it's not going to work.

But could it?

The SSH agent protocol abstracts key management out of SSH itself and into a separate process. When you run "ssh-add .ssh/id_rsa", that key is being loaded into the SSH agent. When SSH wants to use that key to authenticate to a remote system, it asks the SSH agent to perform the cryptographic signatures on its behalf. SSH also supports forwarding the SSH agent protocol over SSH itself, so if you SSH into a remote system then remote clients can also access your keys - this allows you to bounce through one remote system into another without having to copy your keys to those remote systems.

More recently, SSH gained the ability to store SSH keys on hardware tokens such as Yubikeys. If configured appropriately, this means that even if you forward your agent to a remote site, that site can't do anything with your keys unless you physically touch the token. But out of the box, this is only useful for SSH keys - you can't do anything else with this support.

Well, that's what I thought, at least. And then I looked at the code and realised that SSH is communicating with the security tokens using the same library that a browser would, except it ensures that any signature request starts with the string "ssh:" (which a genuine WebAuthn request never will). This constraint can actually be disabled by passing -O no-restrict-websafe to ssh-agent, except that was broken until this weekend. But let's assume there's a glorious future where that patch gets backported everywhere, and see what we can do with it.

First we need to load the key into the security token. For this I ended up hacking up the Go SSH agent support. Annoyingly it doesn't seem to be possible to make calls to the agent without going via one of the exported methods here, so I don't think this logic can be implemented without modifying the agent module itself. But this is basically as simple as adding another key message type that looks something like:
type ecdsaSkKeyMsg struct  
       Type        string  sshtype:"17 25" 
       Curve       string
       PubKeyBytes []byte
       RpId        string
       Flags       uint8
       KeyHandle   []byte
       Reserved    []byte
       Comments    string
       Constraints []byte  ssh:"rest" 
 
Where Type is ssh.KeyAlgoSKECDSA256, Curve is "nistp256", RpId is the identity of the relying party (eg, "webauthn.io"), Flags is 0x1 if you want the user to have to touch the key, KeyHandle is the hardware token's representation of the key (basically an opaque blob that's sufficient for the token to regenerate the keypair - this is generally stored by the remote site and handed back to you when it wants you to authenticate). The other fields can be ignored, other than PubKeyBytes, which is supposed to be the public half of the keypair.

This causes an obvious problem. We have an opaque blob that represents a keypair. We don't have the public key. And OpenSSH verifies that PubKeyByes is a legitimate ecdsa public key before it'll load the key. Fortunately it only verifies that it's a legitimate ecdsa public key, and does nothing to verify that it's related to the private key in any way. So, just generate a new ECDSA key (ecdsa.GenerateKey(elliptic.P256(), rand.Reader)) and marshal it ( elliptic.Marshal(ecKey.Curve, ecKey.X, ecKey.Y)) and we're good. Pass that struct to ssh.Marshal() and then make an agent call.

Now you can use the standard agent interfaces to trigger a signature event. You want to pass the raw challenge (not the hash of the challenge!) - the SSH code will do the hashing itself. If you're using agent forwarding this will be forwarded from the remote system to your local one, and your security token should start blinking - touch it and you'll get back an ssh.Signature blob. ssh.Unmarshal() the Blob member to a struct like
type ecSig struct  
        R *big.Int
        S *big.Int
 
and then ssh.Unmarshal the Rest member to
type authData struct  
        Flags    uint8
        SigCount uint32
 
The signature needs to be converted back to a DER-encoded ASN.1 structure (eg,
var b cryptobyte.Builder
b.AddASN1(asn1.SEQUENCE, func(b *cryptobyte.Builder)  
        b.AddASN1BigInt(ecSig.R)
        b.AddASN1BigInt(ecSig.S)
 )
signatureDER, _ := b.Bytes()
, and then you need to construct the Authenticator Data structure. For this, take the RpId used earlier and generate the sha256. Append the one byte Flags variable, and then convert SigCount to big endian and append those 4 bytes. You should now have a 37 byte structure. This needs to be CBOR encoded (I used github.com/fxamacker/cbor and just called cbor.Marshal(data, cbor.EncOptions )).

Now base64 encode the sha256 of the challenge data, the DER-encoded signature and the CBOR-encoded authenticator data and you've got everything you need to provide to the remote site to satisfy the challenge.

There are alternative approaches - you can use USB/IP to forward the hardware token directly to the remote system. But that means you can't use it locally, so it's less than ideal. Or you could implement a proxy that communicates with the key locally and have that tunneled through to the remote host, but at that point you're just reinventing ssh-agent.

And you should bear in mind that the default behaviour of blocking this sort of request is for a good reason! If someone is able to compromise a remote system that you're SSHed into, they can potentially trick you into hitting the key to sign a request they've made on behalf of an arbitrary site. Obviously they could do the same without any of this if they've compromised your local system, but there is some additional risk to this. It would be nice to have sensible MAC policies that default-denied access to the SSH agent socket and only allowed trustworthy binaries to do so, or maybe have some sort of reasonable flatpak-style portal to gate access. For my threat model I think it's a worthwhile security tradeoff, but you should evaluate that carefully yourself.

Anyway. Now to figure out whether there's a reasonable way to get browsers to work with this.

comment count unavailable comments

19 September 2022

Matthew Garrett: Bring Your Own Disaster

After my last post, someone suggested that having employers be able to restrict keys to machines they control is a bad thing. So here's why I think Bring Your Own Device (BYOD) scenarios are bad not only for employers, but also for users.

There's obvious mutual appeal to having developers use their own hardware rather than rely on employer-provided hardware. The user gets to use hardware they're familiar with, and which matches their ergonomic desires. The employer gets to save on the money required to buy new hardware for the employee. From this perspective, there's a clear win-win outcome.

But once you start thinking about security, it gets more complicated. If I, as an employer, want to ensure that any systems that can access my resources meet a certain security baseline (eg, I don't want my developers using unpatched Windows ME), I need some of my own software installed on there. And that software doesn't magically go away when the user is doing their own thing. If a user lends their machine to their partner, is the partner fully informed about what level of access I have? Are they going to feel that their privacy has been violated if they find out afterwards?

But it's not just about monitoring. If an employee's machine is compromised and the compromise is detected, what happens next? If the employer owns the system then it's easy - you pick up the device for forensic analysis and give the employee a new machine to use while that's going on. If the employee owns the system, they're probably not going to be super enthusiastic about handing over a machine that also contains a bunch of their personal data. In much of the world the law is probably on their side, and even if it isn't then telling the employee that they have a choice between handing over their laptop or getting fired probably isn't going to end well.

But obviously this is all predicated on the idea that an employer needs visibility into what's happening on systems that have access to their systems, or which are used to develop code that they'll be deploying. And I think it's fair to say that not everyone needs that! But if you hold any sort of personal data (including passwords) for any external users, I really do think you need to protect against compromised employee machines, and that does mean having some degree of insight into what's happening on those machines. If you don't want to deal with the complicated consequences of allowing employees to use their own hardware, it's rational to ensure that only employer-owned hardware can be used.

But what about the employers that don't currently need that? If there's no plausible future where you'll host user data, or where you'll sell products to others who'll host user data, then sure! But if that might happen in future (even if it doesn't right now), what's your transition plan? How are you going to deal with employees who are happily using their personal systems right now? At what point are you going to buy new laptops for everyone? BYOD might work for you now, but will it always?

And if your employer insists on employees using their own hardware, those employees should ask what happens in the event of a security breach. Whose responsibility is it to ensure that hardware is kept up to date? Is there an expectation that security can insist on the hardware being handed over for investigation? What information about the employee's use of their own hardware is going to be logged, who has access to those logs, and how long are those logs going to be kept for? If those questions can't be answered in a reasonable way, it's a huge red flag. You shouldn't have to give up your privacy and (potentially) your hardware for a job.

Using technical mechanisms to ensure that employees only use employer-provided hardware is understandably icky, but it's something that allows employers to impose appropriate security policies without violating employee privacy.

comment count unavailable comments

15 September 2022

Matthew Garrett: git signatures with SSH certificates

Last night I complained that git's SSH signature format didn't support using SSH certificates rather than raw keys, and was swiftly corrected, once again highlighting that the best way to make something happen is to complain about it on the internet in order to trigger the universe to retcon it into existence to make you look like a fool. But anyway. Let's talk about making this work!

git's SSH signing support is actually just it shelling out to ssh-keygen with a specific set of options, so let's go through an example of this with ssh-keygen. First, here's my certificate:

$ ssh-keygen -L -f id_aurora-cert.pub
id_aurora-cert.pub:
Type: ecdsa-sha2-nistp256-cert-v01@openssh.com user certificate
Public key: ECDSA-CERT SHA256:(elided)
Signing CA: RSA SHA256:(elided)
Key ID: "mgarrett@aurora.tech"
Serial: 10505979558050566331
Valid: from 2022-09-13T17:23:53 to 2022-09-14T13:24:23
Principals:
mgarrett@aurora.tech
Critical Options: (none)
Extensions:
permit-agent-forwarding
permit-port-forwarding
permit-pty

Ok! Now let's sign something:

$ ssh-keygen -Y sign -f ~/.ssh/id_aurora-cert.pub -n git /tmp/testfile
Signing file /tmp/testfile
Write signature to /tmp/testfile.sig

To verify this we need an allowed signatures file, which should look something like:

*@aurora.tech cert-authority ssh-rsa AAA(elided)

Perfect. Let's verify it:

$ cat /tmp/testfile ssh-keygen -Y verify -f /tmp/allowed_signers -I mgarrett@aurora.tech -n git -s /tmp/testfile.sig
Good "git" signature for mgarrett@aurora.tech with ECDSA-CERT key SHA256:(elided)


Woo! So, how do we make use of this in git? Generating the signatures is as simple as

$ git config --global commit.gpgsign true
$ git config --global gpg.format ssh
$ git config --global user.signingkey /home/mjg59/.ssh/id_aurora-cert.pub


and then getting on with life. Any commits will now be signed with the provided certificate. Unfortunately, git itself won't handle verification of these - it calls ssh-keygen -Y find-principals which doesn't deal with wildcards in the allowed signers file correctly, and then falls back to verifying the signature without making any assertions about identity. Which means you're going to have to implement this in your own CI by extracting the commit and the signature, extracting the identity from the commit metadata and calling ssh-keygen on your own. But it can be made to work!

But why would you want to? The current approach of managing keys for git isn't ideal - you can kind of piggy-back off github/gitlab SSH key infrastructure, but if you're an enterprise using SSH certificates for access then your users don't necessarily have enrolled keys to start with. And using certificates gives you extra benefits, such as having your CA verify that keys are hardware-backed before issuing a cert. Want to ensure that whoever made a commit was actually on an authorised laptop? Now you can!

I'll probably spend a little while looking into whether it's plausible to make the git verification code work with certificates or whether the right thing is to fix up ssh-keygen -Y find-principals to work with wildcard identities, but either way it's probably not much effort to get this working out of the box.

Edit to add: thanks to this commenter for pointing out that current OpenSSH git actually makes this work already!

comment count unavailable comments

1 August 2022

Sergio Talens-Oliag: Using Git Server Hooks on GitLab CE to Validate Tags

Since a long time ago I ve been a gitlab-ce user, in fact I ve set it up on three of the last four companies I ve worked for (initially I installed it using the omnibus packages on a debian server but on the last two places I moved to the docker based installation, as it is easy to maintain and we don t need a big installation as the teams using it are small). On the company I work for now (kyso) we are using it to host all our internal repositories and to do all the CI/CD work (the automatic deployments are triggered by web hooks in some cases, but the rest is all done using gitlab-ci). The majority of projects are using nodejs as programming language and we have automated the publication of npm packages on our gitlab instance npm registry and even the publication into the npmjs registry. To publish the packages we have added rules to the gitlab-ci configuration of the relevant repositories and we publish them when a tag is created. As the we are lazy by definition, I configured the system to use the tag as the package version; I tested if the contents of the package.json where in sync with the expected version and if it was not I updated it and did a force push of the tag with the updated file using the following code on the script that publishes the package:
# Update package version & add it to the .build-args
INITIAL_PACKAGE_VERSION="$(npm pkg get version tr -d '"')"
npm version --allow-same --no-commit-hooks --no-git-tag-version \
  "$CI_COMMIT_TAG"
UPDATED_PACKAGE_VERSION="$(npm pkg get version tr -d '"')"
echo "UPDATED_PACKAGE_VERSION=$UPDATED_PACKAGE_VERSION" >> .build-args
# Update tag if the version was updated or abort
if [ "$INITIAL_PACKAGE_VERSION" != "$UPDATED_PACKAGE_VERSION" ]; then
  if [ -n "$CI_GIT_USER" ] && [ -n "$CI_GIT_TOKEN" ]; then
    git commit -m "Updated version from tag $CI_COMMIT_TAG" package.json
    git tag -f "$CI_COMMIT_TAG" -m "Updated version from tag"
    git push -f -o ci.skip origin "$CI_COMMIT_TAG"
  else
    echo "!!! ERROR !!!"
    echo "The updated tag could not be uploaded."
    echo "Set CI_GIT_USER and CI_GIT_TOKEN or fix the 'package.json' file"
    echo "!!! ERROR !!!"
    exit 1
  fi
fi
This feels a little dirty (we are leaving commits on the tag but not updating the original branch); I thought about trying to find the branch using the tag and update it, but I drop the idea pretty soon as there were multiple issues to consider (i.e. we can have tags pointing to commits present in multiple branches and even if it only points to one the tag does not have to be the HEAD of the branch making the inclusion difficult). In any case this system was working, so we left it until we started to publish to the NPM Registry; as we are using a token to push the packages that we don t want all developers to have access to (right now it would not matter, but when the team grows it will) I started to use gitlab protected branches on the projects that need it and adjusting the .npmrc file using protected variables. The problem then was that we can no longer do a standard force push for a branch (that is the main point of the protected branches feature) unless we use the gitlab api, so the tags with the wrong version started to fail. As the way things were being done seemed dirty anyway I thought that the best way of fixing things was to forbid users to push a tag that includes a version that does not match the package.json version. After thinking about it we decided to use githooks on the gitlab server for the repositories that need it, as we are only interested in tags we are going to use the update hook; it is executed once for each ref to be updated, and takes three parameters:
  • the name of the ref being updated,
  • the old object name stored in the ref,
  • and the new object name to be stored in the ref.
To install our hook we have found the gitaly relative path of each repo and located it on the server filesystem (as I said we are using docker and the gitlab s data directory is on /srv/gitlab/data, so the path to the repo has the form /srv/gitlab/data/git-data/repositories/@hashed/xx/yy/hash.git). Once we have the directory we need to:
  • create a custom_hooks sub directory inside it,
  • add the update script (as we only need one script we used that instead of creating an update.d directory, the good thing is that this will also work with a standard git server renaming the base directory to hooks instead of custom_hooks),
  • make it executable, and
  • change the directory and file ownership to make sure it can be read and executed from the gitlab container
On a console session:
$ cd /srv/gitlab/data/git-data/repositories/@hashed/xx/yy/hash.git
$ mkdir custom_hooks
$ edit_or_copy custom_hooks/update
$ chmod 0755 custom_hooks/update
$ chown --reference=. -R custom_hooks
The update script we are using is as follows:
#!/bin/sh
set -e
# kyso update hook
#
# Right now it checks version.txt or package.json versions against the tag name
# (it supports a 'v' prefix on the tag)
# Arguments
ref_name="$1"
old_rev="$2"
new_rev="$3"
# Initial test
if [ -z "$ref_name" ]    [ -z "$old_rev" ]   [ -z "$new_rev" ]; then
  echo "usage: $0 <ref> <oldrev> <newrev>" >&2
  exit 1
fi
# Get the tag short name
tag_name="$ ref_name##refs/tags/ "
# Exit if the update is not for a tag
if [ "$tag_name" = "$ref_name" ]; then
  exit 0
fi
# Get the null rev value (string of zeros)
zero=$(git hash-object --stdin </dev/null   tr '0-9a-f' '0')
# Get if the tag is new or not
if [ "$old_rev" = "$zero" ]; then
  new_tag="true"
else
  new_tag="false"
fi
# Get the type of revision:
# - delete: if the new_rev is zero
# - commit: annotated tag
# - tag: un-annotated tag
if [ "$new_rev" = "$zero" ]; then
  new_rev_type="delete"
else
  new_rev_type="$(git cat-file -t "$new_rev")"
fi
# Exit if we are deleting a tag (nothing to check here)
if [ "$new_rev_type" = "delete" ]; then
  exit 0
fi
# Check the version against the tag (supports version.txt & package.json)
if git cat-file -e "$new_rev:version.txt" >/dev/null 2>&1; then
  version="$(git cat-file -p "$new_rev:version.txt")"
  if [ "$version" = "$tag_name" ]   [ "$version" = "$ tag_name#v " ]; then
    exit 0
  else
    EMSG="tag '$tag_name' and 'version.txt' contents '$version' don't match"
    echo "GL-HOOK-ERR: $EMSG"
    exit 1
  fi
elif git cat-file -e "$new_rev:package.json" >/dev/null 2>&1; then
  version="$(
    git cat-file -p "$new_rev:package.json"   jsonpath version   tr -d '\[\]"'
  )"
  if [ "$version" = "$tag_name" ]   [ "$version" = "$ tag_name#v " ]; then
    exit 0
  else
    EMSG="tag '$tag_name' and 'package.json' version '$version' don't match"
    echo "GL-HOOK-ERR: $EMSG"
    exit 1
  fi
else
  # No version.txt or package.json file found
  exit 0
fi
Some comments about it:
  • we are only looking for tags, if the ref_name does not have the prefix refs/tags/ the script does an exit 0,
  • although we are checking if the tag is new or not we are not using the value (in gitlab that is handled by the protected tag feature),
  • if we are deleting a tag the script does an exit 0, we don t need to check anything in that case,
  • we are ignoring if the tag is annotated or not (we set the new_rev_type to tag or commit, but we don t use the value),
  • we test first the version.txt file and if it does not exist we check the package.json file, if it does not exist either we do an exit 0, as there is no version to check against and we allow that on a tag,
  • we add the GL-HOOK-ERR: prefix to the messages to show them on the gitlab web interface (can be tested creating a tag from it),
  • to get the version on the package.json file we use the jsonpath binary (it is installed by the jsonpath ruby gem) because it is available on the gitlab container (initially I used sed to get the value, but a real JSON parser is always a better option).
Once the hook is installed when a user tries to push a tag to a repository that has a version.txt file or package.json file and the tag does not match the version (if version.txt is present it takes precedence) the push fails. If the tag matches or the files are not present the tag is added if the user has permission to add it in gitlab (our hook is only executed if the user is allowed to create or update the tag).

28 July 2022

Matthew Garrett: UEFI rootkits and UEFI secure boot

Kaspersky describes a UEFI-implant used to attack Windows systems. Based on it appearing to require patching of the system firmware image, they hypothesise that it's propagated by manually dumping the contents of the system flash, modifying it, and then reflashing it back to the board. This probably requires physical access to the board, so it's not especially terrifying - if you're in a situation where someone's sufficiently enthusiastic about targeting you that they're reflashing your computer by hand, it's likely that you're going to have a bad time regardless.

But let's think about why this is in the firmware at all. Sophos previously discussed an implant that's sufficiently similar in some technical details that Kaspersky suggest they may be related to some degree. One notable difference is that the MyKings implant described by Sophos installs itself into the boot block of legacy MBR partitioned disks. This code will only be executed on old-style BIOS systems (or UEFI systems booting in BIOS compatibility mode), and they have no support for code signatures, so there's no need to be especially clever. Run malicious code in the boot block, patch the next stage loader, follow that chain all the way up to the kernel. Simple.

One notable distinction here is that the MBR boot block approach won't be persistent - if you reinstall the OS, the MBR will be rewritten[1] and the infection is gone. UEFI doesn't really change much here - if you reinstall Windows a new copy of the bootloader will be written out and the UEFI boot variables (that tell the firmware which bootloader to execute) will be updated to point at that. The implant may still be on disk somewhere, but it won't be run.

But there's a way to avoid this. UEFI supports loading firmware-level drivers from disk. If, rather than providing a backdoored bootloader, the implant takes the form of a UEFI driver, the attacker can set a different set of variables that tell the firmware to load that driver at boot time, before running the bootloader. OS reinstalls won't modify these variables, which means the implant will survive and can reinfect the new OS install. The only way to get rid of the implant is to either reformat the drive entirely (which most OS installers won't do by default) or replace the drive before installation.

This is much easier than patching the system firmware, and achieves similar outcomes - the number of infected users who are going to wipe their drives to reinstall is fairly low, and the kernel could be patched to hide the presence of the implant on the filesystem[2]. It's possible that the goal was to make identification as hard as possible, but there's a simpler argument here - if the firmware has UEFI Secure Boot enabled, the firmware will refuse to load such a driver, and the implant won't work. You could certainly just patch the firmware to disable secure boot and lie about it, but if you're at the point of patching the firmware anyway you may as well just do the extra work of installing your implant there.

I think there's a reasonable argument that the existence of firmware-level rootkits suggests that UEFI Secure Boot is doing its job and is pushing attackers into lower levels of the stack in order to obtain the same outcomes. Technologies like Intel's Boot Guard may (in their current form) tend to block user choice, but in theory should be effective in blocking attacks of this form and making things even harder for attackers. It should already be impossible to perform attacks like the one Kaspersky describes on more modern hardware (the system should identify that the firmware has been tampered with and fail to boot), which pushes things even further - attackers will have to take advantage of vulnerabilities in the specific firmware they're targeting. This obviously means there's an incentive to find more firmware vulnerabilities, which means the ability to apply security updates for system firmware as easily as security updates for OS components is vital (hint hint if your system firmware updates aren't available via LVFS you're probably doing it wrong).

We've known that UEFI rootkits have existed for a while (Hacking Team had one in 2015), but it's interesting to see a fairly widespread one out in the wild. Protecting against this kind of attack involves securing the entire boot chain, including the firmware itself. The industry has clearly been making progress in this respect, and it'll be interesting to see whether such attacks become more common (because Secure Boot works but firmware security is bad) or not.

[1] As we all remember from Windows installs overwriting Linux bootloaders
[2] Although this does run the risk of an infected user booting another OS instead, and being able to see the implant

comment count unavailable comments

26 July 2022

Rapha&#235;l Hertzog: Freexian s report about Debian Long Term Support, June 2022

A Debian LTS logo
Like each month, have a look at the work funded by Freexian s Debian LTS offering. Debian project funding No any major updates on running projects.
Two 1, 2 projects are in the pipeline now.
Tryton project is in a review phase. Gradle projects is still fighting in work. In June, we put aside 2254 EUR to fund Debian projects. We re looking forward to receive more projects from various Debian teams! Learn more about the rationale behind this initiative in this article. Debian LTS contributors In June, 15 contributors have been paid to work on Debian LTS, their reports are available: Evolution of the situation In June we released 27 DLAs.

This is a special month, where we have two releases (stretch and jessie) as ELTS and NO release as LTS. Buster is still handled by the security team and will probably be given in LTS hands at the beginning of the August. During this month we are updating the infrastructure, documentation and improve our internal processes to switch to a new release.
Many developers have just returned back from Debconf22, hold in Prizren, Kosovo! Many (E)LTS members could meet face-to-face and discuss some technical and social topics! Also LTS BoF took place, where the project was introduced (link to video).
Thanks to our sponsors Sponsors that joined recently are in bold. We are pleased to welcome Alter Way where their support of Debian is publicly acknowledged at the higher level, see this French quote of Alterway s CEO.

12 July 2022

Matthew Garrett: Responsible stewardship of the UEFI secure boot ecosystem

After I mentioned that Lenovo are now shipping laptops that only boot Windows by default, a few people pointed to a Lenovo document that says:

Starting in 2022 for Secured-core PCs it is a Microsoft requirement for the 3rd Party Certificate to be disabled by default.

"Secured-core" is a term used to describe machines that meet a certain set of Microsoft requirements around firmware security, and by and large it's a good thing - devices that meet these requirements are resilient against a whole bunch of potential attacks in the early boot process. But unfortunately the 2022 requirements don't seem to be publicly available, so it's difficult to know what's being asked for and why. But first, some background.

Most x86 UEFI systems that support Secure Boot trust at least two certificate authorities:

1) The Microsoft Windows Production PCA - this is used to sign the bootloader in production Windows builds. Trusting this is sufficient to boot Windows.
2) The Microsoft Corporation UEFI CA - this is used by Microsoft to sign non-Windows UEFI binaries, including built-in drivers for hardware that needs to work in the UEFI environment (such as GPUs and network cards) and bootloaders for non-Windows.

The apparent secured-core requirement for 2022 is that the second of these CAs should not be trusted by default. As a result, drivers or bootloaders signed with this certificate will not run on these systems. This means that, out of the box, these systems will not boot anything other than Windows[1].

Given the association with the secured-core requirements, this is presumably a security decision of some kind. Unfortunately, we have no real idea what this security decision is intended to protect against. The most likely scenario is concerns about the (in)security of binaries signed with the third-party signing key - there are some legitimate concerns here, but I'm going to cover why I don't think they're terribly realistic.

The first point is that, from a boot security perspective, a signed bootloader that will happily boot unsigned code kind of defeats the point. Kaspersky did it anyway. The second is that even a signed bootloader that is intended to only boot signed code may run into issues in the event of security vulnerabilities - the Boothole vulnerabilities are an example of this, covering multiple issues in GRUB that could allow for arbitrary code execution and potential loading of untrusted code.

So we know that signed bootloaders that will (either through accident or design) execute unsigned code exist. The signatures for all the known vulnerable bootloaders have been revoked, but that doesn't mean there won't be other vulnerabilities discovered in future. Configuring systems so that they don't trust the third-party CA means that those signed bootloaders won't be trusted, which means any future vulnerabilities will be irrelevant. This seems like a simple choice?

There's actually a couple of reasons why I don't think it's anywhere near that simple. The first is that whenever a signed object is booted by the firmware, the trusted certificate used to verify that object is measured into PCR 7 in the TPM. If a system previously booted with something signed with the Windows Production CA, and is now suddenly booting with something signed with the third-party UEFI CA, the values in PCR 7 will be different. TPMs support "sealing" a secret - encrypting it with a policy that the TPM will only decrypt it if certain conditions are met. Microsoft make use of this for their default Bitlocker disk encryption mechanism. The disk encryption key is encrypted by the TPM, and associated with a specific PCR 7 value. If the value of PCR 7 doesn't match, the TPM will refuse to decrypt the key, and the machine won't boot. This means that attempting to attack a Windows system that has Bitlocker enabled using a non-Windows bootloader will fail - the system will be unable to obtain the disk unlock key, which is a strong indication to the owner that they're being attacked.

The second is that this is predicated on the idea that removing the third-party bootloaders and drivers removes all the vulnerabilities. In fact, there's been rather a lot of vulnerabilities in the Windows bootloader. A broad enough vulnerability in the Windows bootloader is arguably a lot worse than a vulnerability in a third-party loader, since it won't change the PCR 7 measurements and the system will boot happily. Removing trust in the third-party CA does nothing to protect against this.

The third reason doesn't apply to all systems, but it does to many. System vendors frequently want to ship diagnostic or management utilities that run in the boot environment, but would prefer not to have to go to the trouble of getting them all signed by Microsoft. The simple solution to this is to ship their own certificate and sign all their tooling directly - the secured-core Lenovo I'm looking at currently is an example of this, with a Lenovo signing certificate. While everything signed with the third-party signing certificate goes through some degree of security review, there's no requirement for any vendor tooling to be reviewed at all. Removing the third-party CA does nothing to protect the user against the code that's most likely to contain vulnerabilities.

Obviously I may be missing something here - Microsoft may well have a strong technical justification. But they haven't shared it, and so right now we're left making guesses. And right now, I just don't see a good security argument.

But let's move on from the technical side of things and discuss the broader issue. The reason UEFI Secure Boot is present on most x86 systems is that Microsoft mandated it back in 2012. Microsoft chose to be the only trusted signing authority. Microsoft made the decision to assert that third-party code could be signed and trusted.

We've certainly learned some things since then, and a bunch of things have changed. Third-party bootloaders based on the Shim infrastructure are now reviewed via a community-managed process. We've had a productive coordinated response to the Boothole incident, which also taught us that the existing revocation strategy wasn't going to scale. In response, the community worked with Microsoft to develop a specification for making it easier to handle similar events in future. And it's also worth noting that after the initial Boothole disclosure was made to the GRUB maintainers, they proactively sought out other vulnerabilities in their codebase rather than simply patching what had been reported. The free software community has gone to great lengths to ensure third-party bootloaders are compatible with the security goals of UEFI Secure Boot.

So, to have Microsoft, the self-appointed steward of the UEFI Secure Boot ecosystem, turn round and say that a bunch of binaries that have been reviewed through processes developed in negotiation with Microsoft, implementing technologies designed to make management of revocation easier for Microsoft, and incorporating fixes for vulnerabilities discovered by the developers of those binaries who notified Microsoft of these issues despite having no obligation to do so, and which have then been signed by Microsoft are now considered by Microsoft to be insecure is, uh, kind of impolite? Especially when unreviewed vendor-signed binaries are still considered trustworthy, despite no external review being carried out at all.

If Microsoft had a set of criteria used to determine whether something is considered sufficiently trustworthy, we could determine which of these we fell short on and do something about that. From a technical perspective, Microsoft could set criteria that would allow a subset of third-party binaries that met additional review be trusted without having to trust all third-party binaries[2]. But, instead, this has been a decision made by the steward of this ecosystem without consulting major stakeholders.

If there are legitimate security concerns, let's talk about them and come up with solutions that fix them without doing a significant amount of collateral damage. Don't complain about a vendor blocking your apps and then do the same thing yourself.

[Edit to add: there seems to be some misunderstanding about where this restriction is being imposed. I bought this laptop because I'm interested in investigating the Microsoft Pluton security processor, but Pluton is not involved at all here. The restriction is being imposed by the firmware running on the main CPU, not any sort of functionality implemented on Pluton]

[1] They'll also refuse to run any drivers that are stored in flash on Thunderbolt devices, which means eGPU setups may be more complicated, as will netbooting off Thunderbolt-attached NICs
[2] Use a different leaf cert to sign the new trust tier, add the old leaf cert to dbx unless a config option is set, leave the existing intermediate in db

comment count unavailable comments

8 July 2022

Matthew Garrett: Lenovo shipping new laptops that only boot Windows by default

I finally managed to get hold of a Thinkpad Z13 to examine a functional implementation of Microsoft's Pluton security co-processor. Trying to boot Linux from a USB stick failed out of the box for no obvious reason, but after further examination the cause became clear - the firmware defaults to not trusting bootloaders or drivers signed with the Microsoft 3rd Party UEFI CA key. This means that given the default firmware configuration, nothing other than Windows will boot. It also means that you won't be able to boot from any third-party external peripherals that are plugged in via Thunderbolt.

There's no security benefit to this. If you want security here you're paying attention to the values measured into the TPM, and thanks to Microsoft's own specification for measurements made into PCR 7, switching from booting Windows to booting something signed with the 3rd party signing key will change the measurements and invalidate any sealed secrets. It's trivial to detect this. Distrusting the 3rd party CA by default doesn't improve security, it just makes it harder for users to boot alternative operating systems.

Lenovo, this isn't OK. The entire architecture of UEFI secure boot is that it allows for security without compromising user choice of OS. Restricting boot to Windows by default provides no security benefit but makes it harder for people to run the OS they want to. Please fix it.

comment count unavailable comments

29 June 2022

Aigars Mahinovs: Long travel in an electric car

Since the first week of April 2022 I have (finally!) changed my company car from a plug-in hybrid to a fully electic car. My new ride, for the next two years, is a BMW i4 M50 in Aventurine Red metallic. An ellegant car with very deep and memorable color, insanely powerful (544 hp/795 Nm), sub-4 second 0-100 km/h, large 84 kWh battery (80 kWh usable), charging up to 210 kW, top speed of 225 km/h and also very efficient (which came out best in this trip) with WLTP range of 510 km and EVDB real range of 435 km. The car also has performance tyres (Hankook Ventus S1 evo3 245/45R18 100Y XL in front and 255/45R18 103Y XL in rear all at recommended 2.5 bar) that have reduced efficiency. So I wanted to document and describe how was it for me to travel ~2000 km (one way) with this, electric, car from south of Germany to north of Latvia. I have done this trip many times before since I live in Germany now and travel back to my relatives in Latvia 1-2 times per year. This was the first time I made this trip in an electric car. And as this trip includes both travelling in Germany (where BEV infrastructure is best in the world) and across Eastern/Northen Europe, I believe that this can be interesting to a few people out there. Normally when I travelled this trip with a gasoline/diesel car I would normally drive for two days with an intermediate stop somewhere around Warsaw with about 12 hours of travel time in each day. This would normally include a couple bathroom stops in each day, at least one longer lunch stop and 3-4 refueling stops on top of that. Normally this would use at least 6 liters of fuel per 100 km on average with total usage of about 270 liters for the whole trip (or about 540 just in fuel costs, nowadays). My (personal) quirk is that both fuel and recharging of my (business) car inside Germany is actually paid by my employer, so it is useful for me to charge up (or fill up) at the last station in Gemany before driving on. The plan for this trip was made in a similar way as when travelling with a gasoline car: travelling as fast as possible on German Autobahn network to last chargin stop on the A4 near G rlitz, there charging up as much as reasonable and then travelling to a hotel in Warsaw, charging there overnight and travelling north towards Ionity chargers in Lithuania from where reaching the final target in north of Latvia should be possible. How did this plan meet the reality? Travelling inside Germany with an electric car was basically perfect. The most efficient way would involve driving fast and hard with top speed of even 180 km/h (where possible due to speed limits and traffic). BMW i4 is very efficient at high speeds with consumption maxing out at 28 kWh/100km when you actually drive at this speed all the time. In real situation in this trip we saw consumption of 20.8-22.2 kWh/100km in the first legs of the trip. The more traffic there is, the more speed limits and roadworks, the lower is the average speed and also the lower the consumption. With this kind of consumption we could comfortably drive 2 hours as fast as we could and then pick any fast charger along the route and in 26 minutes at a charger (50 kWh charged total) we'd be ready to drive for another 2 hours. This lines up very well with recommended rest stops for biological reasons (bathroom, water or coffee, a bit of movement to get blood circulating) and very close to what I had to do anyway with a gasoline car. With a gasoline car I had to refuel first, then park, then go to bathroom and so on. With an electric car I can do all of that while the car is charging and in the end the total time for a stop is very similar. Also not that there was a crazy heat wave going on and temperature outside was at about 34C minimum the whole day and hitting 40C at one point of the trip, so a lot of power was used for cooling. The car has a heat pump standard, but it still was working hard to keep us cool in the sun. The car was able to plan a charging route with all the charging stops required and had all the good options (like multiple intermediate stops) that many other cars (hi Tesla) and mobile apps (hi Google and Apple) do not have yet. There are a couple bugs with charging route and display of current route guidance, those are already fixed and will be delivered with over the air update with July 2022 update. Another good alterantive is the ABRP (A Better Route Planner) that was specifically designed for electric car routing along the best route for charging. Most phone apps (like Google Maps) have no idea about your specific electric car - it has no idea about the battery capacity, charging curve and is missing key live data as well - what is the current consumption and remaining energy in the battery. ABRP is different - it has data and profiles for almost all electric cars and can also be linked to live vehicle data, either via a OBD dongle or via a new Tronity cloud service. Tronity reads data from vehicle-specific cloud service, such as MyBMW service, saves it, tracks history and also re-transmits it to ABRP for live navigation planning. ABRP allows for options and settings that no car or app offers, for example, saying that you want to stop at a particular place for an hour or until battery is charged to 90%, or saying that you have specific charging cards and would only want to stop at chargers that support those. Both the car and the ABRP also support alternate routes even with multiple intermediate stops. In comparison, route planning by Google Maps or Apple Maps or Waze or even Tesla does not really come close. After charging up in the last German fast charger, a more interesting part of the trip started. In Poland the density of high performance chargers (HPC) is much lower than in Germany. There are many chargers (west of Warsaw), but vast majority of them are (relatively) slow 50kW chargers. And that is a difference between putting 50kWh into the car in 23-26 minutes or in 60 minutes. It does not seem too much, but the key bit here is that for 20 minutes there is easy to find stuff that should be done anyway, but after that you are done and you are just waiting for the car and if that takes 4 more minutes or 40 more minutes is a big, perceptual, difference. So using HPC is much, much preferable. So we put in the Ionity charger near Lodz as our intermediate target and the car suggested an intermediate stop at a Greenway charger by Katy Wroclawskie. The location is a bit weird - it has 4 charging stations with 150 kW each. The weird bits are that each station has two CCS connectors, but only one parking place (and the connectors share power, so if two cars were to connect, each would get half power). Also from the front of the location one can only see two stations, the otehr two are semi-hidden around a corner. We actually missed them on the way to Latvia and one person actually waited for the charger behind us for about 10 minutes. We only discovered the other two stations on the way back. With slower speeds in Poland the consumption goes down to 18 kWh/100km which translates to now up to 3 hours driving between stops. At the end of the first day we drove istarting from Ulm from 9:30 in the morning until about 23:00 in the evening with total distance of about 1100 km, 5 charging stops, starting with 92% battery, charging for 26 min (50 kWh), 33 min (57 kWh + lunch), 17 min (23 kWh), 12 min (17 kWh) and 13 min (37 kW). In the last two chargers you can see the difference between a good and fast 150 kW charger at high battery charge level and a really fast Ionity charger at low battery charge level, which makes charging faster still. Arriving to hotel with 23% of battery. Overnight the car charged from a Porsche Destination Charger to 87% (57 kWh). That was a bit less than I would expect from a full power 11kW charger, but good enough. Hotels should really install 11kW Type2 chargers for their guests, it is a really significant bonus that drives more clients to you. The road between Warsaw and Kaunas is the most difficult part of the trip for both driving itself and also for charging. For driving the problem is that there will be a new highway going from Warsaw to Lithuanian border, but it is actually not fully ready yet. So parts of the way one drives on the new, great and wide highway and parts of the way one drives on temporary roads or on old single lane undivided roads. And the most annoying part is navigating between parts as signs are not always clear and the maps are either too old or too new. Some maps do not have the new roads and others have on the roads that have not been actually build or opened to traffic yet. It's really easy to loose ones way and take a significant detour. As far as charging goes, basically there is only the slow 50 kW chargers between Warsaw and Kaunas (for now). We chose to charge on the last charger in Poland, by Suwalki Kaufland. That was not a good idea - there is only one 50 kW CCS and many people decide the same, so there can be a wait. We had to wait 17 minutes before we could charge for 30 more minutes just to get 18 kWh into the battery. Not the best use of time. On the way back we chose a different charger in Lomza where would have a relaxed dinner while the car was charging. That was far more relaxing and a better use of time. We also tried charging at an Orlen charger that was not recommended by our car and we found out why. Unlike all other chargers during our entire trip, this charger did not accept our universal BMW Charging RFID card. Instead it demanded that we download their own Orlen app and register there. The app is only available in some countries (and not in others) and on iPhone it is only available in Polish. That is a bad exception to the rule and a bad example. This is also how most charging works in USA. Here in Europe that is not normal. The normal is to use a charging card - either provided from the car maker or from another supplier (like PlugSufring or Maingau Energy). The providers then make roaming arrangements with all the charging networks, so the cards just work everywhere. In the end the user gets the prices and the bills from their card provider as a single monthly bill. This also saves all any credit card charges for the user. Having a clear, separate RFID card also means that one can easily choose how to pay for each charging session. For example, I have a corporate RFID card that my company pays for (for charging in Germany) and a private BMW Charging card that I am paying myself for (for charging abroad). Having the car itself authenticate direct with the charger (like Tesla does) removes the option to choose how to pay. Having each charge network have to use their own app or token bring too much chaos and takes too much setup. The optimum is having one card that works everywhere and having the option to have additional card or cards for specific purposes. Reaching Ionity chargers in Lithuania is again a breath of fresh air - 20-24 minutes to charge 50 kWh is as expected. One can charge on the first Ionity just enough to reach the next one and then on the second charger one can charge up enough to either reach the Ionity charger in Adazi or the final target in Latvia. There is a huge number of CSDD (Road Traffic and Safety Directorate) managed chargers all over Latvia, but they are 50 kW chargers. Good enough for local travel, but not great for long distance trips. BMW i4 charges at over 50 kW on a HPC even at over 90% battery state of charge (SoC). This means that it is always faster to charge up in a HPC than in a 50 kW charger, if that is at all possible. We also tested the CSDD chargers - they worked without any issues. One could pay with the BMW Charging RFID card, one could use the CSDD e-mobi app or token and one could also use Mobilly - an app that you can use in Latvia for everything from parking to public transport tickets or museums or car washes. We managed to reach our final destination near Aluksne with 17% range remaining after just 3 charging stops: 17+30 min (18 kWh), 24 min (48 kWh), 28 min (36 kWh). Last stop we charged to 90% which took a few extra minutes that would have been optimal. For travel around in Latvia we were charging at our target farmhouse from a normal 3 kW Schuko EU socket. That is very slow. We charged for 33 hours and went from 17% to 94%, so not really full. That was perfectly fine for our purposes. We easily reached Riga, drove to the sea and then back to Aluksne with 8% still in reserve and started charging again for the next trip. If it were required to drive around more and charge faster, we could have used the normal 3-phase 440V connection in the farmhouse to have a red CEE 16A plug installed (same as people use for welders). BMW i4 comes standard with a new BMW Flexible Fast Charger that has changable socket adapters. It comes by default with a Schucko connector in Europe, but for 90 one can buy an adapter for blue CEE plug (3.7 kW) or red CEE 16A or 32A plugs (11 kW). Some public charging stations in France actually use the blue CEE plugs instead of more common Type2 electric car charging stations. The CEE plugs are also common in camping parking places. On the way back the long distance BEV travel was already well understood and did not cause us any problem. From our destination we could easily reach the first Ionity in Lithuania, on the Panevezhis bypass road where in just 8 minutes we got 19 kWh and were ready to drive on to Kaunas, there a longer 32 minute stop before the charging desert of Suwalki Gap that gave us 52 kWh to 90%. That brought us to a shopping mall in Lomzha where we had some food and charged up 39 kWh in lazy 50 minutes. That was enough to bring us to our return hotel for the night - Hotel 500W in Strykow by Lodz that has a 50kW charger on site, while we were having late dinner and preparing for sleep, the car easily recharged to full (71 kWh in 95 minutes), so I just moved it from charger to a parking spot just before going to sleep. Really easy and well flowing day. Second day back went even better as we just needed an 18 minute stop at the same Katy Wroclawskie charger as before to get 22 kWh and that was enough to get back to Germany. After that we were again flying on the Autobahn and charging as needed, 15 min (31 kWh), 23 min (48 kWh) and 31 min (54 kWh + food). We started the day on about 9:40 and were home at 21:40 after driving just over 1000 km on that day. So less than 12 hours for 1000 km travelled, including all charging, bio stops, food and some traffic jams as well. Not bad. Now let's take a look at all the apps and data connections that a technically minded customer can have for their car. Architecturally the car is a network of computers by itself, but it is very secured and normally people do not have any direct access. However, once you log in into the car with your BMW account the car gets your profile info and preferences (seat settings, navigation favorites, ...) and the car then also can start sending information to the BMW backend about its status. This information is then available to the user over multiple different channels. There is no separate channel for each of those data flow. The data only goes once to the backend and then all other communication of apps happens with the backend. First of all the MyBMW app. This is the go-to for everything about the car - seeing its current status and location (when not driving), sending commands to the car (lock, unlock, flash lights, pre-condition, ...) and also monitor and control charging processes. You can also plan a route or destination in the app in advance and then just send it over to the car so it already knows where to drive to when you get to the car. This can also integrate with calendar entries, if you have locations for appointments, for example. This also shows full charging history and allows a very easy export of that data, here I exported all charging sessions from June and then trimmed it back to only sessions relevant to the trip and cut off some design elements to have the data more visible. So one can very easily see when and where we were charging, how much power we got at each spot and (if you set prices for locations) can even show costs. I've already mentioned the Tronity service and its ABRP integration, but it also saves the information that it gets from the car and gathers that data over time. It has nice aspects, like showing the driven routes on a map, having ways to do business trip accounting and having good calendar view. Sadly it does not correctly capture the data for charging sessions (the amounts are incorrect). Update: after talking to Tronity support, it looks like the bug was in the incorrect value for the usable battery capacity for my car. They will look into getting th eright values there by default, but as a workaround one can edit their car in their system (after at least one charging session) and directly set the expected battery capacity (usable) in the car properties on the Tronity web portal settings. One other fun way to see data from your BMW is using the BMW integration in Home Assistant. This brings the car as a device in your own smart home. You can read all the variables from the car current status (and Home Asisstant makes cute historical charts) and you can even see interesting trends, for example for remaining range shows much higher value in Latvia as its prediction is adapted to Latvian road speeds and during the trip it adapts to Polish and then to German road speeds and thus to higher consumption and thus lower maximum predicted remaining range. Having the car attached to the Home Assistant also allows you to attach the car to automations, both as data and event source (like detecting when car enters the "Home" zone) and also as target, so you could flash car lights or even unlock or lock it when certain conditions are met. So, what in the end was the most important thing - cost of the trip? In total we charged up 863 kWh, so that would normally cost one about 290 , which is close to half what this trip would have costed with a gasoline car. Out of that 279 kWh in Germany (paid by my employer) and 154 kWh in the farmhouse (paid by our wonderful relatives :D) so in the end the charging that I actually need to pay adds up to 430 kWh or about 150 . Typically, it took about 400 in fuel that I had to pay to get to Latvia and back. The difference is really nice! In the end I believe that there are three different ways of charging:
  • incidental charging - this is wast majority of charging in the normal day-to-day life. The car gets charged when and where it is convinient to do so along the way. If we go to a movie or a shop and there is a chance to leave the car at a charger, then it can charge up. Works really well, does not take extra time for charging from us.
  • fast charging - charging up at a HPC during optimal charging conditions - from relatively low level to no more than 70-80% while you are still doing all the normal things one would do in a quick stop in a long travel process: bio things, cleaning the windscreen, getting a coffee or a snack.
  • necessary charging - charging from a whatever charger is available just enough to be able to reach the next destination or the next fast charger.
The last category is the only one that is really annoying and should be avoided at all costs. Even by shifting your plans so that you find something else useful to do while necessary charging is happening and thus, at least partially, shifting it over to incidental charging category. Then you are no longer just waiting for the car, you are doing something else and the car magically is charged up again. And when one does that, then travelling with an electric car becomes no more annoying than travelling with a gasoline car. Having more breaks in a trip is a good thing and makes the trips actually easier and less stressfull - I was more relaxed during and after this trip than during previous trips. Having the car air conditioning always be on, even when stopped, was a godsend in the insane heat wave of 30C-38C that we were driving trough. Final stats: 4425 km driven in the trip. Average consumption: 18.7 kWh/100km. Time driving: 2 days and 3 hours. Car regened 152 kWh. Charging stations recharged 863 kWh. Questions? You can use this i4talk forum thread or this Twitter thread to ask them to me.

17 June 2022

Antoine Beaupr : Matrix notes

I have some concerns about Matrix (the protocol, not the movie that came out recently, although I do have concerns about that as well). I've been watching the project for a long time, and it seems more a promising alternative to many protocols like IRC, XMPP, and Signal. This review may sound a bit negative, because it focuses on those concerns. I am the operator of an IRC network and people keep asking me to bridge it with Matrix. I have myself considered just giving up on IRC and converting to Matrix. This space is a living document exploring my research of that problem space. The TL;DR: is that no, I'm not setting up a bridge just yet, and I'm still on IRC. This article was written over the course of the last three months, but I have been watching the Matrix project for years (my logs seem to say 2016 at least). The article is rather long. It will likely take you half an hour to read, so copy this over to your ebook reader, your tablet, or dead trees, and lean back and relax as I show you around the Matrix. Or, alternatively, just jump to a section that interest you, most likely the conclusion.

Introduction to Matrix Matrix is an "open standard for interoperable, decentralised, real-time communication over IP. It can be used to power Instant Messaging, VoIP/WebRTC signalling, Internet of Things communication - or anywhere you need a standard HTTP API for publishing and subscribing to data whilst tracking the conversation history". It's also (when compared with XMPP) "an eventually consistent global JSON database with an HTTP API and pubsub semantics - whilst XMPP can be thought of as a message passing protocol." According to their FAQ, the project started in 2014, has about 20,000 servers, and millions of users. Matrix works over HTTPS but over a special port: 8448.

Security and privacy I have some concerns about the security promises of Matrix. It's advertised as a "secure" with "E2E [end-to-end] encryption", but how does it actually work?

Data retention defaults One of my main concerns with Matrix is data retention, which is a key part of security in a threat model where (for example) an hostile state actor wants to surveil your communications and can seize your devices. On IRC, servers don't actually keep messages all that long: they pass them along to other servers and clients as fast as they can, only keep them in memory, and move on to the next message. There are no concerns about data retention on messages (and their metadata) other than the network layer. (I'm ignoring the issues with user registration, which is a separate, if valid, concern.) Obviously, an hostile server could log everything passing through it, but IRC federations are normally tightly controlled. So, if you trust your IRC operators, you should be fairly safe. Obviously, clients can (and often do, even if OTR is configured!) log all messages, but this is generally not the default. Irssi, for example, does not log by default. IRC bouncers are more likely to log to disk, of course, to be able to do what they do. Compare this to Matrix: when you send a message to a Matrix homeserver, that server first stores it in its internal SQL database. Then it will transmit that message to all clients connected to that server and room, and to all other servers that have clients connected to that room. Those remote servers, in turn, will keep a copy of that message and all its metadata in their own database, by default forever. On encrypted rooms those messages are encrypted, but not their metadata. There is a mechanism to expire entries in Synapse, but it is not enabled by default. So one should generally assume that a message sent on Matrix is never expired.

GDPR in the federation But even if that setting was enabled by default, how do you control it? This is a fundamental problem of the federation: if any user is allowed to join a room (which is the default), those user's servers will log all content and metadata from that room. That includes private, one-on-one conversations, since those are essentially rooms as well. In the context of the GDPR, this is really tricky: who is the responsible party (known as the "data controller") here? It's basically any yahoo who fires up a home server and joins a room. In a federated network, one has to wonder whether GDPR enforcement is even possible at all. But in Matrix in particular, if you want to enforce your right to be forgotten in a given room, you would have to:
  1. enumerate all the users that ever joined the room while you were there
  2. discover all their home servers
  3. start a GDPR procedure against all those servers
I recognize this is a hard problem to solve while still keeping an open ecosystem. But I believe that Matrix should have much stricter defaults towards data retention than right now. Message expiry should be enforced by default, for example. (Note that there are also redaction policies that could be used to implement part of the GDPR automatically, see the privacy policy discussion below on that.) Also keep in mind that, in the brave new peer-to-peer world that Matrix is heading towards, the boundary between server and client is likely to be fuzzier, which would make applying the GDPR even more difficult. Update: this comment links to this post (in german) which apparently studied the question and concluded that Matrix is not GDPR-compliant. In fact, maybe Synapse should be designed so that there's no configurable flag to turn off data retention. A bit like how most system loggers in UNIX (e.g. syslog) come with a log retention system that typically rotate logs after a few weeks or month. Historically, this was designed to keep hard drives from filling up, but it also has the added benefit of limiting the amount of personal information kept on disk in this modern day. (Arguably, syslog doesn't rotate logs on its own, but, say, Debian GNU/Linux, as an installed system, does have log retention policies well defined for installed packages, and those can be discussed. And "no expiry" is definitely a bug.

Matrix.org privacy policy When I first looked at Matrix, five years ago, Element.io was called Riot.im and had a rather dubious privacy policy:
We currently use cookies to support our use of Google Analytics on the Website and Service. Google Analytics collects information about how you use the Website and Service. [...] This helps us to provide you with a good experience when you browse our Website and use our Service and also allows us to improve our Website and our Service.
When I asked Matrix people about why they were using Google Analytics, they explained this was for development purposes and they were aiming for velocity at the time, not privacy (paraphrasing here). They also included a "free to snitch" clause:
If we are or believe that we are under a duty to disclose or share your personal data, we will do so in order to comply with any legal obligation, the instructions or requests of a governmental authority or regulator, including those outside of the UK.
Those are really broad terms, above and beyond what is typically expected legally. Like the current retention policies, such user tracking and ... "liberal" collaboration practices with the state set a bad precedent for other home servers. Thankfully, since the above policy was published (2017), the GDPR was "implemented" (2018) and it seems like both the Element.io privacy policy and the Matrix.org privacy policy have been somewhat improved since. Notable points of the new privacy policies:
  • 2.3.1.1: the "federation" section actually outlines that "Federated homeservers and Matrix clients which respect the Matrix protocol are expected to honour these controls and redaction/erasure requests, but other federated homeservers are outside of the span of control of Element, and we cannot guarantee how this data will be processed"
  • 2.6: users under the age of 16 should not use the matrix.org service
  • 2.10: Upcloud, Mythic Beast, Amazon, and CloudFlare possibly have access to your data (it's nice to at least mention this in the privacy policy: many providers don't even bother admitting to this kind of delegation)
  • Element 2.2.1: mentions many more third parties (Twilio, Stripe, Quaderno, LinkedIn, Twitter, Google, Outplay, PipeDrive, HubSpot, Posthog, Sentry, and Matomo (phew!) used when you are paying Matrix.org for hosting
I'm not super happy with all the trackers they have on the Element platform, but then again you don't have to use that service. Your favorite homeserver (assuming you are not on Matrix.org) probably has their own Element deployment, hopefully without all that garbage. Overall, this is all a huge improvement over the previous privacy policy, so hats off to the Matrix people for figuring out a reasonable policy in such a tricky context. I particularly like this bit:
We will forget your copy of your data upon your request. We will also forward your request to be forgotten onto federated homeservers. However - these homeservers are outside our span of control, so we cannot guarantee they will forget your data.
It's great they implemented those mechanisms and, after all, if there's an hostile party in there, nothing can prevent them from using screenshots to just exfiltrate your data away from the client side anyways, even with services typically seen as more secure, like Signal. As an aside, I also appreciate that Matrix.org has a fairly decent code of conduct, based on the TODO CoC which checks all the boxes in the geekfeminism wiki.

Metadata handling Overall, privacy protections in Matrix mostly concern message contents, not metadata. In other words, who's talking with who, when and from where is not well protected. Compared to a tool like Signal, which goes through great lengths to anonymize that data with features like private contact discovery, disappearing messages, sealed senders, and private groups, Matrix is definitely behind. (Note: there is an issue open about message lifetimes in Element since 2020, but it's not at even at the MSC stage yet.) This is a known issue (opened in 2019) in Synapse, but this is not just an implementation issue, it's a flaw in the protocol itself. Home servers keep join/leave of all rooms, which gives clear text information about who is talking to. Synapse logs may also contain privately identifiable information that home server admins might not be aware of in the first place. Those log rotation policies are separate from the server-level retention policy, which may be confusing for a novice sysadmin. Combine this with the federation: even if you trust your home server to do the right thing, the second you join a public room with third-party home servers, those ideas kind of get thrown out because those servers can do whatever they want with that information. Again, a problem that is hard to solve in any federation. To be fair, IRC doesn't have a great story here either: any client knows not only who's talking to who in a room, but also typically their client IP address. Servers can (and often do) obfuscate this, but often that obfuscation is trivial to reverse. Some servers do provide "cloaks" (sometimes automatically), but that's kind of a "slap-on" solution that actually moves the problem elsewhere: now the server knows a little more about the user. Overall, I would worry much more about a Matrix home server seizure than a IRC or Signal server seizure. Signal does get subpoenas, and they can only give out a tiny bit of information about their users: their phone number, and their registration, and last connection date. Matrix carries a lot more information in its database.

Amplification attacks on URL previews I (still!) run an Icecast server and sometimes share links to it on IRC which, obviously, also ends up on (more than one!) Matrix home servers because some people connect to IRC using Matrix. This, in turn, means that Matrix will connect to that URL to generate a link preview. I feel this outlines a security issue, especially because those sockets would be kept open seemingly forever. I tried to warn the Matrix security team but somehow, I don't think this issue was taken very seriously. Here's the disclosure timeline:
  • January 18: contacted Matrix security
  • January 19: response: already reported as a bug
  • January 20: response: can't reproduce
  • January 31: timeout added, considered solved
  • January 31: I respond that I believe the security issue is underestimated, ask for clearance to disclose
  • February 1: response: asking for two weeks delay after the next release (1.53.0) including another patch, presumably in two weeks' time
  • February 22: Matrix 1.53.0 released
  • April 14: I notice the release, ask for clearance again
  • April 14: response: referred to the public disclosure
There are a couple of problems here:
  1. the bug was publicly disclosed in September 2020, and not considered a security issue until I notified them, and even then, I had to insist
  2. no clear disclosure policy timeline was proposed or seems established in the project (there is a security disclosure policy but it doesn't include any predefined timeline)
  3. I wasn't informed of the disclosure
  4. the actual solution is a size limit (10MB, already implemented), a time limit (30 seconds, implemented in PR 11784), and a content type allow list (HTML, "media" or JSON, implemented in PR 11936), and I'm not sure it's adequate
  5. (pure vanity:) I did not make it to their Hall of fame
I'm not sure those solutions are adequate because they all seem to assume a single home server will pull that one URL for a little while then stop. But in a federated network, many (possibly thousands) home servers may be connected in a single room at once. If an attacker drops a link into such a room, all those servers would connect to that link all at once. This is an amplification attack: a small amount of traffic will generate a lot more traffic to a single target. It doesn't matter there are size or time limits: the amplification is what matters here. It should also be noted that clients that generate link previews have more amplification because they are more numerous than servers. And of course, the default Matrix client (Element) does generate link previews as well. That said, this is possibly not a problem specific to Matrix: any federated service that generates link previews may suffer from this. I'm honestly not sure what the solution is here. Maybe moderation? Maybe link previews are just evil? All I know is there was this weird bug in my Icecast server and I tried to ring the bell about it, and it feels it was swept under the rug. Somehow I feel this is bound to blow up again in the future, even with the current mitigation.

Moderation In Matrix like elsewhere, Moderation is a hard problem. There is a detailed moderation guide and much of this problem space is actively worked on in Matrix right now. A fundamental problem with moderating a federated space is that a user banned from a room can rejoin the room from another server. This is why spam is such a problem in Email, and why IRC networks have stopped federating ages ago (see the IRC history for that fascinating story).

The mjolnir bot The mjolnir moderation bot is designed to help with some of those things. It can kick and ban users, redact all of a user's message (as opposed to one by one), all of this across multiple rooms. It can also subscribe to a federated block list published by matrix.org to block known abusers (users or servers). Bans are pretty flexible and can operate at the user, room, or server level. Matrix people suggest making the bot admin of your channels, because you can't take back admin from a user once given.

The command-line tool There's also a new command line tool designed to do things like:
  • System notify users (all users/users from a list, specific user)
  • delete sessions/devices not seen for X days
  • purge the remote media cache
  • select rooms with various criteria (external/local/empty/created by/encrypted/cleartext)
  • purge history of theses rooms
  • shutdown rooms
This tool and Mjolnir are based on the admin API built into Synapse.

Rate limiting Synapse has pretty good built-in rate-limiting which blocks repeated login, registration, joining, or messaging attempts. It may also end up throttling servers on the federation based on those settings.

Fundamental federation problems Because users joining a room may come from another server, room moderators are at the mercy of the registration and moderation policies of those servers. Matrix is like IRC's +R mode ("only registered users can join") by default, except that anyone can register their own homeserver, which makes this limited. Server admins can block IP addresses and home servers, but those tools are not easily available to room admins. There is an API (m.room.server_acl in /devtools) but it is not reliable (thanks Austin Huang for the clarification). Matrix has the concept of guest accounts, but it is not used very much, and virtually no client or homeserver supports it. This contrasts with the way IRC works: by default, anyone can join an IRC network even without authentication. Some channels require registration, but in general you are free to join and look around (until you get blocked, of course). I have seen anecdotal evidence (CW: Twitter, nitter link) that "moderating bridges is hell", and I can imagine why. Moderation is already hard enough on one federation, when you bridge a room with another network, you inherit all the problems from that network but without the entire abuse control tools from the original network's API...

Room admins Matrix, in particular, has the problem that room administrators (which have the power to redact messages, ban users, and promote other users) are bound to their Matrix ID which is, in turn, bound to their home servers. This implies that a home server administrators could (1) impersonate a given user and (2) use that to hijack the room. So in practice, the home server is the trust anchor for rooms, not the user themselves. That said, if server B administrator hijack user joe on server B, they will hijack that room on that specific server. This will not (necessarily) affect users on the other servers, as servers could refuse parts of the updates or ban the compromised account (or server). It does seem like a major flaw that room credentials are bound to Matrix identifiers, as opposed to the E2E encryption credentials. In an encrypted room even with fully verified members, a compromised or hostile home server can still take over the room by impersonating an admin. That admin (or even a newly minted user) can then send events or listen on the conversations. This is even more frustrating when you consider that Matrix events are actually signed and therefore have some authentication attached to them, acting like some sort of Merkle tree (as it contains a link to previous events). That signature, however, is made from the homeserver PKI keys, not the client's E2E keys, which makes E2E feel like it has been "bolted on" later.

Availability While Matrix has a strong advantage over Signal in that it's decentralized (so anyone can run their own homeserver,), I couldn't find an easy way to run a "multi-primary" setup, or even a "redundant" setup (even if with a single primary backend), short of going full-on "replicate PostgreSQL and Redis data", which is not typically for the faint of heart.

How this works in IRC On IRC, it's quite easy to setup redundant nodes. All you need is:
  1. a new machine (with it's own public address with an open port)
  2. a shared secret (or certificate) between that machine and an existing one on the network
  3. a connect block on both servers
That's it: the node will join the network and people can connect to it as usual and share the same user/namespace as the rest of the network. The servers take care of synchronizing state: you do not need to worry about replicating a database server. (Now, experienced IRC people will know there's a catch here: IRC doesn't have authentication built in, and relies on "services" which are basically bots that authenticate users (I'm simplifying, don't nitpick). If that service goes down, the network still works, but then people can't authenticate, and they can start doing nasty things like steal people's identity if they get knocked offline. But still: basic functionality still works: you can talk in rooms and with users that are on the reachable network.)

User identities Matrix is more complicated. Each "home server" has its own identity namespace: a specific user (say @anarcat:matrix.org) is bound to that specific home server. If that server goes down, that user is completely disconnected. They could register a new account elsewhere and reconnect, but then they basically lose all their configuration: contacts, joined channels are all lost. (Also notice how the Matrix IDs don't look like a typical user address like an email in XMPP. They at least did their homework and got the allocation for the scheme.)

Rooms Users talk to each other in "rooms", even in one-to-one communications. (Rooms are also used for other things like "spaces", they're basically used for everything, think "everything is a file" kind of tool.) For rooms, home servers act more like IRC nodes in that they keep a local state of the chat room and synchronize it with other servers. Users can keep talking inside a room if the server that originally hosts the room goes down. Rooms can have a local, server-specific "alias" so that, say, #room:matrix.org is also visible as #room:example.com on the example.com home server. Both addresses refer to the same room underlying room. (Finding this in the Element settings is not obvious though, because that "alias" are actually called a "local address" there. So to create such an alias (in Element), you need to go in the room settings' "General" section, "Show more" in "Local address", then add the alias name (e.g. foo), and then that room will be available on your example.com homeserver as #foo:example.com.) So a room doesn't belong to a server, it belongs to the federation, and anyone can join the room from any serer (if the room is public, or if invited otherwise). You can create a room on server A and when a user from server B joins, the room will be replicated on server B as well. If server A fails, server B will keep relaying traffic to connected users and servers. A room is therefore not fundamentally addressed with the above alias, instead ,it has a internal Matrix ID, which basically a random string. It has a server name attached to it, but that was made just to avoid collisions. That can get a little confusing. For example, the #fractal:gnome.org room is an alias on the gnome.org server, but the room ID is !hwiGbsdSTZIwSRfybq:matrix.org. That's because the room was created on matrix.org, but the preferred branding is gnome.org now. As an aside, rooms, by default, live forever, even after the last user quits. There's an admin API to delete rooms and a tombstone event to redirect to another one, but neither have a GUI yet. The latter is part of MSC1501 ("Room version upgrades") which allows a room admin to close a room, with a message and a pointer to another room.

Spaces Discovering rooms can be tricky: there is a per-server room directory, but Matrix.org people are trying to deprecate it in favor of "Spaces". Room directories were ripe for abuse: anyone can create a room, so anyone can show up in there. It's possible to restrict who can add aliases, but anyways directories were seen as too limited. In contrast, a "Space" is basically a room that's an index of other rooms (including other spaces), so existing moderation and administration mechanism that work in rooms can (somewhat) work in spaces as well. This enables a room directory that works across federation, regardless on which server they were originally created. New users can be added to a space or room automatically in Synapse. (Existing users can be told about the space with a server notice.) This gives admins a way to pre-populate a list of rooms on a server, which is useful to build clusters of related home servers, providing some sort of redundancy, at the room -- not user -- level.

Home servers So while you can workaround a home server going down at the room level, there's no such thing at the home server level, for user identities. So if you want those identities to be stable in the long term, you need to think about high availability. One limitation is that the domain name (e.g. matrix.example.com) must never change in the future, as renaming home servers is not supported. The documentation used to say you could "run a hot spare" but that has been removed. Last I heard, it was not possible to run a high-availability setup where multiple, separate locations could replace each other automatically. You can have high performance setups where the load gets distributed among workers, but those are based on a shared database (Redis and PostgreSQL) backend. So my guess is it would be possible to create a "warm" spare server of a matrix home server with regular PostgreSQL replication, but that is not documented in the Synapse manual. This sort of setup would also not be useful to deal with networking issues or denial of service attacks, as you will not be able to spread the load over multiple network locations easily. Redis and PostgreSQL heroes are welcome to provide their multi-primary solution in the comments. In the meantime, I'll just point out this is a solution that's handled somewhat more gracefully in IRC, by having the possibility of delegating the authentication layer.

Delegations If you do not want to run a Matrix server yourself, it's possible to delegate the entire thing to another server. There's a server discovery API which uses the .well-known pattern (or SRV records, but that's "not recommended" and a bit confusing) to delegate that service to another server. Be warned that the server still needs to be explicitly configured for your domain. You can't just put:
  "m.server": "matrix.org:443"  
... on https://example.com/.well-known/matrix/server and start using @you:example.com as a Matrix ID. That's because Matrix doesn't support "virtual hosting" and you'd still be connecting to rooms and people with your matrix.org identity, not example.com as you would normally expect. This is also why you cannot rename your home server. The server discovery API is what allows servers to find each other. Clients, on the other hand, use the client-server discovery API: this is what allows a given client to find your home server when you type your Matrix ID on login.

Performance The high availability discussion brushed over the performance of Matrix itself, but let's now dig into that.

Horizontal scalability There were serious scalability issues of the main Matrix server, Synapse, in the past. So the Matrix team has been working hard to improve its design. Since Synapse 1.22 the home server can horizontally scale to multiple workers (see this blog post for details) which can make it easier to scale large servers.

Other implementations There are other promising home servers implementations from a performance standpoint (dendrite, Golang, entered beta in late 2020; conduit, Rust, beta; others), but none of those are feature-complete so there's a trade-off to be made there. Synapse is also adding a lot of feature fast, so it's an open question whether the others will ever catch up. (I have heard that Dendrite might actually surpass Synapse in features within a few years, which would put Synapse in a more "LTS" situation.)

Latency Matrix can feel slow sometimes. For example, joining the "Matrix HQ" room in Element (from matrix.debian.social) takes a few minutes and then fails. That is because the home server has to sync the entire room state when you join the room. There was promising work on this announced in the lengthy 2021 retrospective, and some of that work landed (partial sync) in the 1.53 release already. Other improvements coming include sliding sync, lazy loading over federation, and fast room joins. So that's actually something that could be fixed in the fairly short term. But in general, communication in Matrix doesn't feel as "snappy" as on IRC or even Signal. It's hard to quantify this without instrumenting a full latency test bed (for example the tools I used in the terminal emulators latency tests), but even just typing in a web browser feels slower than typing in a xterm or Emacs for me. Even in conversations, I "feel" people don't immediately respond as fast. In fact, this could be an interesting double-blind experiment to make: have people guess whether they are talking to a person on Matrix, XMPP, or IRC, for example. My theory would be that people could notice that Matrix users are slower, if only because of the TCP round-trip time each message has to take.

Transport Some courageous person actually made some tests of various messaging platforms on a congested network. His evaluation was basically:
  • Briar: uses Tor, so unusable except locally
  • Matrix: "struggled to send and receive messages", joining a room takes forever as it has to sync all history, "took 20-30 seconds for my messages to be sent and another 20 seconds for further responses"
  • XMPP: "worked in real-time, full encryption, with nearly zero lag"
So that was interesting. I suspect IRC would have also fared better, but that's just a feeling. Other improvements to the transport layer include support for websocket and the CoAP proxy work from 2019 (targeting 100bps links), but both seem stalled at the time of writing. The Matrix people have also announced the pinecone p2p overlay network which aims at solving large, internet-scale routing problems. See also this talk at FOSDEM 2022.

Usability

Onboarding and workflow The workflow for joining a room, when you use Element web, is not great:
  1. click on a link in a web browser
  2. land on (say) https://matrix.to/#/#matrix-dev:matrix.org
  3. offers "Element", yeah that's sounds great, let's click "Continue"
  4. land on https://app.element.io/#/room%2F%23matrix-dev%3Amatrix.org and then you need to register, aaargh
As you might have guessed by now, there is a specification to solve this, but web browsers need to adopt it as well, so that's far from actually being solved. At least browsers generally know about the matrix: scheme, it's just not exactly clear what they should do with it, especially when the handler is just another web page (e.g. Element web). In general, when compared with tools like Signal or WhatsApp, Matrix doesn't fare so well in terms of user discovery. I probably have some of my normal contacts that have a Matrix account as well, but there's really no way to know. It's kind of creepy when Signal tells you "this person is on Signal!" but it's also pretty cool that it works, and they actually implemented it pretty well. Registration is also less obvious: in Signal, the app confirms your phone number automatically. It's friction-less and quick. In Matrix, you need to learn about home servers, pick one, register (with a password! aargh!), and then setup encryption keys (not default), etc. It's a lot more friction. And look, I understand: giving away your phone number is a huge trade-off. I don't like it either. But it solves a real problem and makes encryption accessible to a ton more people. Matrix does have "identity servers" that can serve that purpose, but I don't feel confident sharing my phone number there. It doesn't help that the identity servers don't have private contact discovery: giving them your phone number is a more serious security compromise than with Signal. There's a catch-22 here too: because no one feels like giving away their phone numbers, no one does, and everyone assumes that stuff doesn't work anyways. Like it or not, Signal forcing people to divulge their phone number actually gives them critical mass that means actually a lot of my relatives are on Signal and I don't have to install crap like WhatsApp to talk with them.

5 minute clients evaluation Throughout all my tests I evaluated a handful of Matrix clients, mostly from Flathub because almost none of them are packaged in Debian. Right now I'm using Element, the flagship client from Matrix.org, in a web browser window, with the PopUp Window extension. This makes it look almost like a native app, and opens links in my main browser window (instead of a new tab in that separate window), which is nice. But I'm tired of buying memory to feed my web browser, so this indirection has to stop. Furthermore, I'm often getting completely logged off from Element, which means re-logging in, recovering my security keys, and reconfiguring my settings. That is extremely annoying. Coming from Irssi, Element is really "GUI-y" (pronounced "gooey"). Lots of clickety happening. To mark conversations as read, in particular, I need to click-click-click on all the tabs that have some activity. There's no "jump to latest message" or "mark all as read" functionality as far as I could tell. In Irssi the former is built-in (alt-a) and I made a custom /READ command for the latter:
/ALIAS READ script exec \$_->activity(0) for Irssi::windows
And yes, that's a Perl script in my IRC client. I am not aware of any Matrix client that does stuff like that, except maybe Weechat, if we can call it a Matrix client, or Irssi itself, now that it has a Matrix plugin (!). As for other clients, I have looked through the Matrix Client Matrix (confusing right?) to try to figure out which one to try, and, even after selecting Linux as a filter, the chart is just too wide to figure out anything. So I tried those, kind of randomly:
  • Fractal
  • Mirage
  • Nheko
  • Quaternion
Unfortunately, I lost my notes on those, I don't actually remember which one did what. I still have a session open with Mirage, so I guess that means it's the one I preferred, but I remember they were also all very GUI-y. Maybe I need to look at weechat-matrix or gomuks. At least Weechat is scriptable so I could continue playing the power-user. Right now my strategy with messaging (and that includes microblogging like Twitter or Mastodon) is that everything goes through my IRC client, so Weechat could actually fit well in there. Going with gomuks, on the other hand, would mean running it in parallel with Irssi or ... ditching IRC, which is a leap I'm not quite ready to take just yet. Oh, and basically none of those clients (except Nheko and Element) support VoIP, which is still kind of a second-class citizen in Matrix. It does not support large multimedia rooms, for example: Jitsi was used for FOSDEM instead of the native videoconferencing system.

Bots This falls a little aside the "usability" section, but I didn't know where to put this... There's a few Matrix bots out there, and you are likely going to be able to replace your existing bots with Matrix bots. It's true that IRC has a long and impressive history with lots of various bots doing various things, but given how young Matrix is, there's still a good variety:
  • maubot: generic bot with tons of usual plugins like sed, dice, karma, xkcd, echo, rss, reminder, translate, react, exec, gitlab/github webhook receivers, weather, etc
  • opsdroid: framework to implement "chat ops" in Matrix, connects with Matrix, GitHub, GitLab, Shell commands, Slack, etc
  • matrix-nio: another framework, used to build lots more bots like:
    • hemppa: generic bot with various functionality like weather, RSS feeds, calendars, cron jobs, OpenStreetmaps lookups, URL title snarfing, wolfram alpha, astronomy pic of the day, Mastodon bridge, room bridging, oh dear
    • devops: ping, curl, etc
    • podbot: play podcast episodes from AntennaPod
    • cody: Python, Ruby, Javascript REPL
    • eno: generic bot, "personal assistant"
  • mjolnir: moderation bot
  • hookshot: bridge with GitLab/GitHub
  • matrix-monitor-bot: latency monitor
One thing I haven't found an equivalent for is Debian's MeetBot. There's an archive bot but it doesn't have topics or a meeting chair, or HTML logs.

Working on Matrix As a developer, I find Matrix kind of intimidating. The specification is huge. The official specification itself looks somewhat digestable: it's only 6 APIs so that looks, at first, kind of reasonable. But whenever you start asking complicated questions about Matrix, you quickly fall into the Matrix Spec Change specification (which, yes, is a separate specification). And there are literally hundreds of MSCs flying around. It's hard to tell what's been adopted and what hasn't, and even harder to figure out if your specific client has implemented it. (One trendy answer to this problem is to "rewrite it in rust": Matrix are working on implementing a lot of those specifications in a matrix-rust-sdk that's designed to take the implementation details away from users.) Just taking the latest weekly Matrix report, you find that three new MSCs proposed, just last week! There's even a graph that shows the number of MSCs is progressing steadily, at 600+ proposals total, with the majority (300+) "new". I would guess the "merged" ones are at about 150. That's a lot of text which includes stuff like 3D worlds which, frankly, I don't think you should be working on when you have such important security and usability problems. (The internet as a whole, arguably, doesn't fare much better. RFC600 is a really obscure discussion about "INTERFACING AN ILLINOIS PLASMA TERMINAL TO THE ARPANET". Maybe that's how many MSCs will end up as well, left forgotten in the pits of history.) And that's the thing: maybe the Matrix people have a different objective than I have. They want to connect everything to everything, and make Matrix a generic transport for all sorts of applications, including virtual reality, collaborative editors, and so on. I just want secure, simple messaging. Possibly with good file transfers, and video calls. That it works with existing stuff is good, and it should be federated to remove the "Signal point of failure". So I'm a bit worried with the direction all those MSCs are taking, especially when you consider that clients other than Element are still struggling to keep up with basic features like end-to-end encryption or room discovery, never mind voice or spaces...

Conclusion Overall, Matrix is somehow in the space XMPP was a few years ago. It has a ton of features, pretty good clients, and a large community. It seems to have gained some of the momentum that XMPP has lost. It may have the most potential to replace Signal if something bad would happen to it (like, I don't know, getting banned or going nuts with cryptocurrency)... But it's really not there yet, and I don't see Matrix trying to get there either, which is a bit worrisome.

Looking back at history I'm also worried that we are repeating the errors of the past. The history of federated services is really fascinating:. IRC, FTP, HTTP, and SMTP were all created in the early days of the internet, and are all still around (except, arguably, FTP, which was removed from major browsers recently). All of them had to face serious challenges in growing their federation. IRC had numerous conflicts and forks, both at the technical level but also at the political level. The history of IRC is really something that anyone working on a federated system should study in detail, because they are bound to make the same mistakes if they are not familiar with it. The "short" version is:
  • 1988: Finnish researcher publishes first IRC source code
  • 1989: 40 servers worldwide, mostly universities
  • 1990: EFnet ("eris-free network") fork which blocks the "open relay", named Eris - followers of Eris form the A-net, which promptly dissolves itself, with only EFnet remaining
  • 1992: Undernet fork, which offered authentication ("services"), routing improvements and timestamp-based channel synchronisation
  • 1994: DALnet fork, from Undernet, again on a technical disagreement
  • 1995: Freenode founded
  • 1996: IRCnet forks from EFnet, following a flame war of historical proportion, splitting the network between Europe and the Americas
  • 1997: Quakenet founded
  • 1999: (XMPP founded)
  • 2001: 6 million users, OFTC founded
  • 2002: DALnet peaks at 136,000 users
  • 2003: IRC as a whole peaks at 10 million users, EFnet peaks at 141,000 users
  • 2004: (Facebook founded), Undernet peaks at 159,000 users
  • 2005: Quakenet peaks at 242,000 users, IRCnet peaks at 136,000 (Youtube founded)
  • 2006: (Twitter founded)
  • 2009: (WhatsApp, Pinterest founded)
  • 2010: (TextSecure AKA Signal, Instagram founded)
  • 2011: (Snapchat founded)
  • ~2013: Freenode peaks at ~100,000 users
  • 2016: IRCv3 standardisation effort started (TikTok founded)
  • 2021: Freenode self-destructs, Libera chat founded
  • 2022: Libera peaks at 50,000 users, OFTC peaks at 30,000 users
(The numbers were taken from the Wikipedia page and Netsplit.de. Note that I also include other networks launch in parenthesis for context.) Pretty dramatic, don't you think? Eventually, somehow, IRC became irrelevant for most people: few people are even aware of it now. With less than a million users active, it's smaller than Mastodon, XMPP, or Matrix at this point.1 If I were to venture a guess, I'd say that infighting, lack of a standardization body, and a somewhat annoying protocol meant the network could not grow. It's also possible that the decentralised yet centralised structure of IRC networks limited their reliability and growth. But large social media companies have also taken over the space: observe how IRC numbers peak around the time the wave of large social media companies emerge, especially Facebook (2.9B users!!) and Twitter (400M users).

Where the federated services are in history Right now, Matrix, and Mastodon (and email!) are at the "pre-EFnet" stage: anyone can join the federation. Mastodon has started working on a global block list of fascist servers which is interesting, but it's still an open federation. Right now, Matrix is totally open, but matrix.org publishes a (federated) block list of hostile servers (#matrix-org-coc-bl:matrix.org, yes, of course it's a room). Interestingly, Email is also in that stage, where there are block lists of spammers, and it's a race between those blockers and spammers. Large email providers, obviously, are getting closer to the EFnet stage: you could consider they only accept email from themselves or between themselves. It's getting increasingly hard to deliver mail to Outlook and Gmail for example, partly because of bias against small providers, but also because they are including more and more machine-learning tools to sort through email and those systems are, fundamentally, unknowable. It's not quite the same as splitting the federation the way EFnet did, but the effect is similar. HTTP has somehow managed to live in a parallel universe, as it's technically still completely federated: anyone can start a web server if they have a public IP address and anyone can connect to it. The catch, of course, is how you find the darn thing. Which is how Google became one of the most powerful corporations on earth, and how they became the gatekeepers of human knowledge online. I have only briefly mentioned XMPP here, and my XMPP fans will undoubtedly comment on that, but I think it's somewhere in the middle of all of this. It was co-opted by Facebook and Google, and both corporations have abandoned it to its fate. I remember fondly the days where I could do instant messaging with my contacts who had a Gmail account. Those days are gone, and I don't talk to anyone over Jabber anymore, unfortunately. And this is a threat that Matrix still has to face. It's also the threat Email is currently facing. On the one hand corporations like Facebook want to completely destroy it and have mostly succeeded: many people just have an email account to register on things and talk to their friends over Instagram or (lately) TikTok (which, I know, is not Facebook, but they started that fire). On the other hand, you have corporations like Microsoft and Google who are still using and providing email services because, frankly, you still do need email for stuff, just like fax is still around but they are more and more isolated in their own silo. At this point, it's only a matter of time they reach critical mass and just decide that the risk of allowing external mail coming in is not worth the cost. They'll simply flip the switch and work on an allow-list principle. Then we'll have closed the loop and email will be dead, just like IRC is "dead" now. I wonder which path Matrix will take. Could it liberate us from these vicious cycles? Update: this generated some discussions on lobste.rs.

  1. According to Wikipedia, there are currently about 500 distinct IRC networks operating, on about 1,000 servers, serving over 250,000 users. In contrast, Mastodon seems to be around 5 million users, Matrix.org claimed at FOSDEM 2021 to have about 28 million globally visible accounts, and Signal lays claim to over 40 million souls. XMPP claims to have "millions" of users on the xmpp.org homepage but the FAQ says they don't actually know. On the proprietary silo side of the fence, this page says
    • Facebook: 2.9 billion users
    • WhatsApp: 2B
    • Instagram: 1.4B
    • TikTok: 1B
    • Snapchat: 500M
    • Pinterest: 480M
    • Twitter: 397M
    Notable omission from that list: Youtube, with its mind-boggling 2.6 billion users... Those are not the kind of numbers you just "need to convince a brother or sister" to grow the network...

26 May 2022

Sergio Talens-Oliag: New Blog Config

As promised, on this post I m going to explain how I ve configured this blog using hugo, asciidoctor and the papermod theme, how I publish it using nginx, how I ve integrated the remark42 comment system and how I ve automated its publication using gitea and json2file-go. It is a long post, but I hope that at least parts of it can be interesting for some, feel free to ignore it if that is not your case

Hugo Configuration

Theme settingsThe site is using the PaperMod theme and as I m using asciidoctor to publish my content I ve adjusted the settings to improve how things are shown with it. The current config.yml file is the one shown below (probably some of the settings are not required nor being used right now, but I m including the current file, so this post will have always the latest version of it):
config.yml
baseURL: https://blogops.mixinet.net/
title: Mixinet BlogOps
paginate: 5
theme: PaperMod
destination: public/
enableInlineShortcodes: true
enableRobotsTXT: true
buildDrafts: false
buildFuture: false
buildExpired: false
enableEmoji: true
pygmentsUseClasses: true
minify:
  disableXML: true
  minifyOutput: true
languages:
  en:
    languageName: "English"
    description: "Mixinet BlogOps - https://blogops.mixinet.net/"
    author: "Sergio Talens-Oliag"
    weight: 1
    title: Mixinet BlogOps
    homeInfoParams:
      Title: "Sergio Talens-Oliag Technical Blog"
      Content: >
        ![Mixinet BlogOps](/images/mixinet-blogops.png)
    taxonomies:
      category: categories
      tag: tags
      series: series
    menu:
      main:
        - name: Archive
          url: archives
          weight: 5
        - name: Categories
          url: categories/
          weight: 10
        - name: Tags
          url: tags/
          weight: 10
        - name: Search
          url: search/
          weight: 15
outputs:
  home:
    - HTML
    - RSS
    - JSON
params:
  env: production
  defaultTheme: light
  disableThemeToggle: false
  ShowShareButtons: true
  ShowReadingTime: true
  disableSpecial1stPost: true
  disableHLJS: true
  displayFullLangName: true
  ShowPostNavLinks: true
  ShowBreadCrumbs: true
  ShowCodeCopyButtons: true
  ShowRssButtonInSectionTermList: true
  ShowFullTextinRSS: true
  ShowToc: true
  TocOpen: false
  comments: true
  remark42SiteID: "blogops"
  remark42Url: "/remark42"
  profileMode:
    enabled: false
    title: Sergio Talens-Oliag Technical Blog
    imageUrl: "/images/mixinet-blogops.png"
    imageTitle: Mixinet BlogOps
    buttons:
      - name: Archives
        url: archives
      - name: Categories
        url: categories
      - name: Tags
        url: tags
  socialIcons:
    - name: CV
      url: "https://www.uv.es/~sto/cv/"
    - name: Debian
      url: "https://people.debian.org/~sto/"
    - name: GitHub
      url: "https://github.com/sto/"
    - name: GitLab
      url: "https://gitlab.com/stalens/"
    - name: Linkedin
      url: "https://www.linkedin.com/in/sergio-talens-oliag/"
    - name: RSS
      url: "index.xml"
  assets:
    disableHLJS: true
    favicon: "/favicon.ico"
    favicon16x16:  "/favicon-16x16.png"
    favicon32x32:  "/favicon-32x32.png"
    apple_touch_icon:  "/apple-touch-icon.png"
    safari_pinned_tab:  "/safari-pinned-tab.svg"
  fuseOpts:
    isCaseSensitive: false
    shouldSort: true
    location: 0
    distance: 1000
    threshold: 0.4
    minMatchCharLength: 0
    keys: ["title", "permalink", "summary", "content"]
markup:
  asciidocExt:
    attributes:  
    backend: html5s
    extensions: ['asciidoctor-html5s','asciidoctor-diagram']
    failureLevel: fatal
    noHeaderOrFooter: true
    preserveTOC: false
    safeMode: unsafe
    sectionNumbers: false
    trace: false
    verbose: false
    workingFolderCurrent: true
privacy:
  vimeo:
    disabled: false
    simple: true
  twitter:
    disabled: false
    enableDNT: true
    simple: true
  instagram:
    disabled: false
    simple: true
  youtube:
    disabled: false
    privacyEnhanced: true
services:
  instagram:
    disableInlineCSS: true
  twitter:
    disableInlineCSS: true
security:
  exec:
    allow:
      - '^asciidoctor$'
      - '^dart-sass-embedded$'
      - '^go$'
      - '^npx$'
      - '^postcss$'
Some notes about the settings:
  • disableHLJS and assets.disableHLJS are set to true; we plan to use rouge on adoc and the inclusion of the hljs assets adds styles that collide with the ones used by rouge.
  • ShowToc is set to true and the TocOpen setting is set to false to make the ToC appear collapsed initially. My plan was to use the asciidoctor ToC, but after trying I believe that the theme one looks nice and I don t need to adjust styles, although it has some issues with the html5s processor (the admonition titles use <h6> and they are shown on the ToC, which is weird), to fix it I ve copied the layouts/partial/toc.html to my site repository and replaced the range of headings to end at 5 instead of 6 (in fact 5 still seems a lot, but as I don t think I ll use that heading level on the posts it doesn t really matter).
  • params.profileMode values are adjusted, but for now I ve left it disabled setting params.profileMode.enabled to false and I ve set the homeInfoParams to show more or less the same content with the latest posts under it (I ve added some styles to my custom.css style sheet to center the text and image of the first post to match the look and feel of the profile).
  • On the asciidocExt section I ve adjusted the backend to use html5s, I ve added the asciidoctor-html5s and asciidoctor-diagram extensions to asciidoctor and adjusted the workingFolderCurrent to true to make asciidoctor-diagram work right (haven t tested it yet).

Theme customisationsTo write in asciidoctor using the html5s processor I ve added some files to the assets/css/extended directory:
  1. As said before, I ve added the file assets/css/extended/custom.css to make the homeInfoParams look like the profile page and I ve also changed a little bit some theme styles to make things look better with the html5s output:
    custom.css
    /* Fix first entry alignment to make it look like the profile */
    .first-entry   text-align: center;  
    .first-entry img   display: inline;  
    /**
     * Remove margin for .post-content code and reduce padding to make it look
     * better with the asciidoctor html5s output.
     **/
    .post-content code   margin: auto 0; padding: 4px;  
  2. I ve also added the file assets/css/extended/adoc.css with some styles taken from the asciidoctor-default.css, see this blog post about the original file; mine is the same after formatting it with css-beautify and editing it to use variables for the colors to support light and dark themes:
    adoc.css
    /* AsciiDoctor*/
    table  
        border-collapse: collapse;
        border-spacing: 0
     
    .admonitionblock>table  
        border-collapse: separate;
        border: 0;
        background: none;
        width: 100%
     
    .admonitionblock>table td.icon  
        text-align: center;
        width: 80px
     
    .admonitionblock>table td.icon img  
        max-width: none
     
    .admonitionblock>table td.icon .title  
        font-weight: bold;
        font-family: "Open Sans", "DejaVu Sans", sans-serif;
        text-transform: uppercase
     
    .admonitionblock>table td.content  
        padding-left: 1.125em;
        padding-right: 1.25em;
        border-left: 1px solid #ddddd8;
        color: var(--primary)
     
    .admonitionblock>table td.content>:last-child>:last-child  
        margin-bottom: 0
     
    .admonitionblock td.icon [class^="fa icon-"]  
        font-size: 2.5em;
        text-shadow: 1px 1px 2px var(--secondary);
        cursor: default
     
    .admonitionblock td.icon .icon-note::before  
        content: "\f05a";
        color: var(--icon-note-color)
     
    .admonitionblock td.icon .icon-tip::before  
        content: "\f0eb";
        color: var(--icon-tip-color)
     
    .admonitionblock td.icon .icon-warning::before  
        content: "\f071";
        color: var(--icon-warning-color)
     
    .admonitionblock td.icon .icon-caution::before  
        content: "\f06d";
        color: var(--icon-caution-color)
     
    .admonitionblock td.icon .icon-important::before  
        content: "\f06a";
        color: var(--icon-important-color)
     
    .conum[data-value]  
        display: inline-block;
        color: #fff !important;
        background-color: rgba(100, 100, 0, .8);
        -webkit-border-radius: 100px;
        border-radius: 100px;
        text-align: center;
        font-size: .75em;
        width: 1.67em;
        height: 1.67em;
        line-height: 1.67em;
        font-family: "Open Sans", "DejaVu Sans", sans-serif;
        font-style: normal;
        font-weight: bold
     
    .conum[data-value] *  
        color: #fff !important
     
    .conum[data-value]+b  
        display: none
     
    .conum[data-value]::after  
        content: attr(data-value)
     
    pre .conum[data-value]  
        position: relative;
        top: -.125em
     
    b.conum *  
        color: inherit !important
     
    .conum:not([data-value]):empty  
        display: none
     
  3. The previous file uses variables from a partial copy of the theme-vars.css file that changes the highlighted code background color and adds the color definitions used by the admonitions:
    theme-vars.css
    :root  
        /* Solarized base2 */
        /* --hljs-bg: rgb(238, 232, 213); */
        /* Solarized base3 */
        /* --hljs-bg: rgb(253, 246, 227); */
        /* Solarized base02 */
        --hljs-bg: rgb(7, 54, 66);
        /* Solarized base03 */
        /* --hljs-bg: rgb(0, 43, 54); */
        /* Default asciidoctor theme colors */
        --icon-note-color: #19407c;
        --icon-tip-color: var(--primary);
        --icon-warning-color: #bf6900;
        --icon-caution-color: #bf3400;
        --icon-important-color: #bf0000
     
    .dark  
        --hljs-bg: rgb(7, 54, 66);
        /* Asciidoctor theme colors with tint for dark background */
        --icon-note-color: #3e7bd7;
        --icon-tip-color: var(--primary);
        --icon-warning-color: #ff8d03;
        --icon-caution-color: #ff7847;
        --icon-important-color: #ff3030
     
  4. The previous styles use font-awesome, so I ve downloaded its resources for version 4.7.0 (the one used by asciidoctor) storing the font-awesome.css into on the assets/css/extended dir (that way it is merged with the rest of .css files) and copying the fonts to the static/assets/fonts/ dir (will be served directly):
    FA_BASE_URL="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0"
    curl "$FA_BASE_URL/css/font-awesome.css" \
      > assets/css/extended/font-awesome.css
    for f in FontAwesome.otf fontawesome-webfont.eot \
      fontawesome-webfont.svg fontawesome-webfont.ttf \
      fontawesome-webfont.woff fontawesome-webfont.woff2; do
        curl "$FA_BASE_URL/fonts/$f" > "static/assets/fonts/$f"
    done
  5. As already said the default highlighter is disabled (it provided a css compatible with rouge) so we need a css to do the highlight styling; as rouge provides a way to export them, I ve created the assets/css/extended/rouge.css file with the thankful_eyes theme:
    rougify style thankful_eyes > assets/css/extended/rouge.css
  6. To support the use of the html5s backend with admonitions I ve added a variation of the example found on this blog post to assets/js/adoc-admonitions.js:
    adoc-admonitions.js
    // replace the default admonitions block with a table that uses a format
    // similar to the standard asciidoctor ... as we are using fa-icons here there
    // is no need to add the icons: font entry on the document.
    window.addEventListener('load', function ()  
      const admonitions = document.getElementsByClassName('admonition-block')
      for (let i = admonitions.length - 1; i >= 0; i--)  
        const elm = admonitions[i]
        const type = elm.classList[1]
        const title = elm.getElementsByClassName('block-title')[0];
    	const label = title.getElementsByClassName('title-label')[0]
    		.innerHTML.slice(0, -1);
        elm.removeChild(elm.getElementsByClassName('block-title')[0]);
        const text = elm.innerHTML
        const parent = elm.parentNode
        const tempDiv = document.createElement('div')
        tempDiv.innerHTML =  <div class="admonitionblock $ type ">
        <table>
          <tbody>
            <tr>
              <td class="icon">
                <i class="fa icon-$ type " title="$ label "></i>
              </td>
              <td class="content">
                $ text 
              </td>
            </tr>
          </tbody>
        </table>
      </div> 
        const input = tempDiv.childNodes[0]
        parent.replaceChild(input, elm)
       
     )
    and enabled its minified use on the layouts/partials/extend_footer.html file adding the following lines to it:
     - $admonitions := slice (resources.Get "js/adoc-admonitions.js")
        resources.Concat "assets/js/adoc-admonitions.js"   minify   fingerprint  
    <script defer crossorigin="anonymous" src="  $admonitions.RelPermalink  "
      integrity="  $admonitions.Data.Integrity  "></script>

Remark42 configurationTo integrate Remark42 with the PaperMod theme I ve created the file layouts/partials/comments.html with the following content based on the remark42 documentation, including extra code to sync the dark/light setting with the one set on the site:
comments.html
<div id="remark42"></div>
<script>
  var remark_config =  
    host:   .Site.Params.remark42Url  ,
    site_id:   .Site.Params.remark42SiteID  ,
    url:   .Permalink  ,
    locale:   .Site.Language.Lang  
   ;
  (function(c)  
    /* Adjust the theme using the local-storage pref-theme if set */
    if (localStorage.getItem("pref-theme") === "dark")  
      remark_config.theme = "dark";
      else if (localStorage.getItem("pref-theme") === "light")  
      remark_config.theme = "light";
     
    /* Add remark42 widget */
    for(var i = 0; i < c.length; i++) 
      var d = document, s = d.createElement('script');
      s.src = remark_config.host + '/web/' + c[i] +'.js';
      s.defer = true;
      (d.head   d.body).appendChild(s);
     
   )(remark_config.components   ['embed']);
</script>
In development I use it with anonymous comments enabled, but to avoid SPAM the production site uses social logins (for now I ve only enabled Github & Google, if someone requests additional services I ll check them, but those were the easy ones for me initially). To support theme switching with remark42 I ve also added the following inside the layouts/partials/extend_footer.html file:
 - if (not site.Params.disableThemeToggle)  
<script>
/* Function to change theme when the toggle button is pressed */
document.getElementById("theme-toggle").addEventListener("click", () =>  
  if (typeof window.REMARK42 != "undefined")  
    if (document.body.className.includes('dark'))  
      window.REMARK42.changeTheme('light');
      else  
      window.REMARK42.changeTheme('dark');
     
   
 );
</script>
 - end  
With this code if the theme-toggle button is pressed we change the remark42 theme before the PaperMod one (that s needed here only, on page loads the remark42 theme is synced with the main one using the code from the layouts/partials/comments.html shown earlier).

Development setupTo preview the site on my laptop I m using docker-compose with the following configuration:
docker-compose.yaml
version: "2"
services:
  hugo:
    build:
      context: ./docker/hugo-adoc
      dockerfile: ./Dockerfile
    image: sto/hugo-adoc
    container_name: hugo-adoc-blogops
    restart: always
    volumes:
      - .:/documents
    command: server --bind 0.0.0.0 -D -F
    user: $ APP_UID :$ APP_GID 
  nginx:
    image: nginx:latest
    container_name: nginx-blogops
    restart: always
    volumes:
      - ./nginx/default.conf:/etc/nginx/conf.d/default.conf
    ports:
      -  1313:1313
  remark42:
    build:
      context: ./docker/remark42
      dockerfile: ./Dockerfile
    image: sto/remark42
    container_name: remark42-blogops
    restart: always
    env_file:
      - ./.env
      - ./remark42/env.dev
    volumes:
      - ./remark42/var.dev:/srv/var
To run it properly we have to create the .env file with the current user ID and GID on the variables APP_UID and APP_GID (if we don t do it the files can end up being owned by a user that is not the same as the one running the services):
$ echo "APP_UID=$(id -u)\nAPP_GID=$(id -g)" > .env
The Dockerfile used to generate the sto/hugo-adoc is:
Dockerfile
FROM asciidoctor/docker-asciidoctor:latest
RUN gem install --no-document asciidoctor-html5s &&\
 apk update && apk add --no-cache curl libc6-compat &&\
 repo_path="gohugoio/hugo" &&\
 api_url="https://api.github.com/repos/$repo_path/releases/latest" &&\
 download_url="$(\
  curl -sL "$api_url"  \
  sed -n "s/^.*download_url\": \"\\(.*.extended.*Linux-64bit.tar.gz\)\"/\1/p"\
 )" &&\
 curl -sL "$download_url" -o /tmp/hugo.tgz &&\
 tar xf /tmp/hugo.tgz hugo &&\
 install hugo /usr/bin/ &&\
 rm -f hugo /tmp/hugo.tgz &&\
 /usr/bin/hugo version &&\
 apk del curl && rm -rf /var/cache/apk/*
# Expose port for live server
EXPOSE 1313
ENTRYPOINT ["/usr/bin/hugo"]
CMD [""]
If you review it you will see that I m using the docker-asciidoctor image as the base; the idea is that this image has all I need to work with asciidoctor and to use hugo I only need to download the binary from their latest release at github (as we are using an image based on alpine we also need to install the libc6-compat package, but once that is done things are working fine for me so far). The image does not launch the server by default because I don t want it to; in fact I use the same docker-compose.yml file to publish the site in production simply calling the container without the arguments passed on the docker-compose.yml file (see later). When running the containers with docker-compose up (or docker compose up if you have the docker-compose-plugin package installed) we also launch a nginx container and the remark42 service so we can test everything together. The Dockerfile for the remark42 image is the original one with an updated version of the init.sh script:
Dockerfile
FROM umputun/remark42:latest
COPY init.sh /init.sh
The updated init.sh is similar to the original, but allows us to use an APP_GID variable and updates the /etc/group file of the container so the files get the right user and group (with the original script the group is always 1001):
init.sh
#!/sbin/dinit /bin/sh
uid="$(id -u)"
if [ "$ uid " -eq "0" ]; then
  echo "init container"
  # set container's time zone
  cp "/usr/share/zoneinfo/$ TIME_ZONE " /etc/localtime
  echo "$ TIME_ZONE " >/etc/timezone
  echo "set timezone $ TIME_ZONE  ($(date))"
  # set UID & GID for the app
  if [ "$ APP_UID " ]   [ "$ APP_GID " ]; then
    [ "$ APP_UID " ]   APP_UID="1001"
    [ "$ APP_GID " ]   APP_GID="$ APP_UID "
    echo "set custom APP_UID=$ APP_UID  & APP_GID=$ APP_GID "
    sed -i "s/^app:x:1001:1001:/app:x:$ APP_UID :$ APP_GID :/" /etc/passwd
    sed -i "s/^app:x:1001:/app:x:$ APP_GID :/" /etc/group
  else
    echo "custom APP_UID and/or APP_GID not defined, using 1001:1001"
  fi
  chown -R app:app /srv /home/app
fi
echo "prepare environment"
# replace  % REMARK_URL %  by content of REMARK_URL variable
find /srv -regex '.*\.\(html\ js\ mjs\)$' -print \
  -exec sed -i "s % REMARK_URL % $ REMARK_URL  g"   \;
if [ -n "$ SITE_ID " ]; then
  #replace "site_id: 'remark'" by SITE_ID
  sed -i "s 'remark' '$ SITE_ID ' g" /srv/web/*.html
fi
echo "execute \"$*\""
if [ "$ uid " -eq "0" ]; then
  exec su-exec app "$@"
else
  exec "$@"
fi
The environment file used with remark42 for development is quite minimal:
env.dev
TIME_ZONE=Europe/Madrid
REMARK_URL=http://localhost:1313/remark42
SITE=blogops
SECRET=123456
ADMIN_SHARED_ID=sto
AUTH_ANON=true
EMOJI=true
And the nginx/default.conf file used to publish the service locally is simple too:
default.conf
server   
 listen 1313;
 server_name localhost;
 location /  
    proxy_pass http://hugo:1313;
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
  
 location /remark42/  
    rewrite /remark42/(.*) /$1 break;
    proxy_pass http://remark42:8080/;
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
   
 

Production setupThe VM where I m publishing the blog runs Debian GNU/Linux and uses binaries from local packages and applications packaged inside containers. To run the containers I m using docker-ce (I could have used podman instead, but I already had it installed on the machine, so I stayed with it). The binaries used on this project are included on the following packages from the main Debian repository:
  • git to clone & pull the repository,
  • jq to parse json files from shell scripts,
  • json2file-go to save the webhook messages to files,
  • inotify-tools to detect when new files are stored by json2file-go and launch scripts to process them,
  • nginx to publish the site using HTTPS and work as proxy for json2file-go and remark42 (I run it using a container),
  • task-spool to queue the scripts that update the deployment.
And I m using docker and docker compose from the debian packages on the docker repository:
  • docker-ce to run the containers,
  • docker-compose-plugin to run docker compose (it is a plugin, so no - in the name).

Repository checkoutTo manage the git repository I ve created a deploy key, added it to gitea and cloned the project on the /srv/blogops PATH (that route is owned by a regular user that has permissions to run docker, as I said before).

Compiling the site with hugoTo compile the site we are using the docker-compose.yml file seen before, to be able to run it first we build the container images and once we have them we launch hugo using docker compose run:
$ cd /srv/blogops
$ git pull
$ docker compose build
$ if [ -d "./public" ]; then rm -rf ./public; fi
$ docker compose run hugo --
The compilation leaves the static HTML on /srv/blogops/public (we remove the directory first because hugo does not clean the destination folder as jekyll does). The deploy script re-generates the site as described and moves the public directory to its final place for publishing.

Running remark42 with dockerOn the /srv/blogops/remark42 folder I have the following docker-compose.yml:
docker-compose.yml
version: "2"
services:
  remark42:
    build:
      context: ../docker/remark42
      dockerfile: ./Dockerfile
    image: sto/remark42
    env_file:
      - ../.env
      - ./env.prod
    container_name: remark42
    restart: always
    volumes:
      - ./var.prod:/srv/var
    ports:
      - 127.0.0.1:8042:8080
The ../.env file is loaded to get the APP_UID and APP_GID variables that are used by my version of the init.sh script to adjust file permissions and the env.prod file contains the rest of the settings for remark42, including the social network tokens (see the remark42 documentation for the available parameters, I don t include my configuration here because some of them are secrets).

Nginx configurationThe nginx configuration for the blogops.mixinet.net site is as simple as:
server  
  listen 443 ssl http2;
  server_name blogops.mixinet.net;
  ssl_certificate /etc/letsencrypt/live/blogops.mixinet.net/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/blogops.mixinet.net/privkey.pem;
  include /etc/letsencrypt/options-ssl-nginx.conf;
  ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;
  access_log /var/log/nginx/blogops.mixinet.net-443.access.log;
  error_log  /var/log/nginx/blogops.mixinet.net-443.error.log;
  root /srv/blogops/nginx/public_html;
  location /  
    try_files $uri $uri/ =404;
   
  include /srv/blogops/nginx/remark42.conf;
 
server  
  listen 80 ;
  listen [::]:80 ;
  server_name blogops.mixinet.net;
  access_log /var/log/nginx/blogops.mixinet.net-80.access.log;
  error_log  /var/log/nginx/blogops.mixinet.net-80.error.log;
  if ($host = blogops.mixinet.net)  
    return 301 https://$host$request_uri;
   
  return 404;
 
On this configuration the certificates are managed by certbot and the server root directory is on /srv/blogops/nginx/public_html and not on /srv/blogops/public; the reason for that is that I want to be able to compile without affecting the running site, the deployment script generates the site on /srv/blogops/public and if all works well we rename folders to do the switch, making the change feel almost atomic.

json2file-go configurationAs I have a working WireGuard VPN between the machine running gitea at my home and the VM where the blog is served, I m going to configure the json2file-go to listen for connections on a high port using a self signed certificate and listening on IP addresses only reachable through the VPN. To do it we create a systemd socket to run json2file-go and adjust its configuration to listen on a private IP (we use the FreeBind option on its definition to be able to launch the service even when the IP is not available, that is, when the VPN is down). The following script can be used to set up the json2file-go configuration:
setup-json2file.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
BASE_DIR="/srv/blogops/webhook"
J2F_DIR="$BASE_DIR/json2file"
TLS_DIR="$BASE_DIR/tls"
J2F_SERVICE_NAME="json2file-go"
J2F_SERVICE_DIR="/etc/systemd/system/json2file-go.service.d"
J2F_SERVICE_OVERRIDE="$J2F_SERVICE_DIR/override.conf"
J2F_SOCKET_DIR="/etc/systemd/system/json2file-go.socket.d"
J2F_SOCKET_OVERRIDE="$J2F_SOCKET_DIR/override.conf"
J2F_BASEDIR_FILE="/etc/json2file-go/basedir"
J2F_DIRLIST_FILE="/etc/json2file-go/dirlist"
J2F_CRT_FILE="/etc/json2file-go/certfile"
J2F_KEY_FILE="/etc/json2file-go/keyfile"
J2F_CRT_PATH="$TLS_DIR/crt.pem"
J2F_KEY_PATH="$TLS_DIR/key.pem"
# ----
# MAIN
# ----
# Install packages used with json2file for the blogops site
sudo apt update
sudo apt install -y json2file-go uuid
if [ -z "$(type mkcert)" ]; then
  sudo apt install -y mkcert
fi
sudo apt clean
# Configuration file values
J2F_USER="$(id -u)"
J2F_GROUP="$(id -g)"
J2F_DIRLIST="blogops:$(uuid)"
J2F_LISTEN_STREAM="172.31.31.1:4443"
# Configure json2file
[ -d "$J2F_DIR" ]   mkdir "$J2F_DIR"
sudo sh -c "echo '$J2F_DIR' >'$J2F_BASEDIR_FILE'"
[ -d "$TLS_DIR" ]   mkdir "$TLS_DIR"
if [ ! -f "$J2F_CRT_PATH" ]   [ ! -f "$J2F_KEY_PATH" ]; then
  mkcert -cert-file "$J2F_CRT_PATH" -key-file "$J2F_KEY_PATH" "$(hostname -f)"
fi
sudo sh -c "echo '$J2F_CRT_PATH' >'$J2F_CRT_FILE'"
sudo sh -c "echo '$J2F_KEY_PATH' >'$J2F_KEY_FILE'"
sudo sh -c "cat >'$J2F_DIRLIST_FILE'" <<EOF
$(echo "$J2F_DIRLIST"   tr ';' '\n')
EOF
# Service override
[ -d "$J2F_SERVICE_DIR" ]   sudo mkdir "$J2F_SERVICE_DIR"
sudo sh -c "cat >'$J2F_SERVICE_OVERRIDE'" <<EOF
[Service]
User=$J2F_USER
Group=$J2F_GROUP
EOF
# Socket override
[ -d "$J2F_SOCKET_DIR" ]   sudo mkdir "$J2F_SOCKET_DIR"
sudo sh -c "cat >'$J2F_SOCKET_OVERRIDE'" <<EOF
[Socket]
# Set FreeBind to listen on missing addresses (the VPN can be down sometimes)
FreeBind=true
# Set ListenStream to nothing to clear its value and add the new value later
ListenStream=
ListenStream=$J2F_LISTEN_STREAM
EOF
# Restart and enable service
sudo systemctl daemon-reload
sudo systemctl stop "$J2F_SERVICE_NAME"
sudo systemctl start "$J2F_SERVICE_NAME"
sudo systemctl enable "$J2F_SERVICE_NAME"
# ----
# vim: ts=2:sw=2:et:ai:sts=2
Warning: The script uses mkcert to create the temporary certificates, to install the package on bullseye the backports repository must be available.

Gitea configurationTo make gitea use our json2file-go server we go to the project and enter into the hooks/gitea/new page, once there we create a new webhook of type gitea and set the target URL to https://172.31.31.1:4443/blogops and on the secret field we put the token generated with uuid by the setup script:
sed -n -e 's/blogops://p' /etc/json2file-go/dirlist
The rest of the settings can be left as they are:
  • Trigger on: Push events
  • Branch filter: *
Warning: We are using an internal IP and a self signed certificate, that means that we have to review that the webhook section of the app.ini of our gitea server allows us to call the IP and skips the TLS verification (you can see the available options on the gitea documentation). The [webhook] section of my server looks like this:
[webhook]
ALLOWED_HOST_LIST=private
SKIP_TLS_VERIFY=true
Once we have the webhook configured we can try it and if it works our json2file server will store the file on the /srv/blogops/webhook/json2file/blogops/ folder.

The json2file spooler scriptWith the previous configuration our system is ready to receive webhook calls from gitea and store the messages on files, but we have to do something to process those files once they are saved in our machine. An option could be to use a cronjob to look for new files, but we can do better on Linux using inotify we will use the inotifywait command from inotify-tools to watch the json2file output directory and execute a script each time a new file is moved inside it or closed after writing (IN_CLOSE_WRITE and IN_MOVED_TO events). To avoid concurrency problems we are going to use task-spooler to launch the scripts that process the webhooks using a queue of length 1, so they are executed one by one in a FIFO queue. The spooler script is this:
blogops-spooler.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
BASE_DIR="/srv/blogops/webhook"
BIN_DIR="$BASE_DIR/bin"
TSP_DIR="$BASE_DIR/tsp"
WEBHOOK_COMMAND="$BIN_DIR/blogops-webhook.sh"
# ---------
# FUNCTIONS
# ---------
queue_job()  
  echo "Queuing job to process file '$1'"
  TMPDIR="$TSP_DIR" TS_SLOTS="1" TS_MAXFINISHED="10" \
    tsp -n "$WEBHOOK_COMMAND" "$1"
 
# ----
# MAIN
# ----
INPUT_DIR="$1"
if [ ! -d "$INPUT_DIR" ]; then
  echo "Input directory '$INPUT_DIR' does not exist, aborting!"
  exit 1
fi
[ -d "$TSP_DIR" ]   mkdir "$TSP_DIR"
echo "Processing existing files under '$INPUT_DIR'"
find "$INPUT_DIR" -type f   sort   while read -r _filename; do
  queue_job "$_filename"
done
# Use inotifywatch to process new files
echo "Watching for new files under '$INPUT_DIR'"
inotifywait -q -m -e close_write,moved_to --format "%w%f" -r "$INPUT_DIR"  
  while read -r _filename; do
    queue_job "$_filename"
  done
# ----
# vim: ts=2:sw=2:et:ai:sts=2
To run it as a daemon we install it as a systemd service using the following script:
setup-spooler.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
BASE_DIR="/srv/blogops/webhook"
BIN_DIR="$BASE_DIR/bin"
J2F_DIR="$BASE_DIR/json2file"
SPOOLER_COMMAND="$BIN_DIR/blogops-spooler.sh '$J2F_DIR'"
SPOOLER_SERVICE_NAME="blogops-j2f-spooler"
SPOOLER_SERVICE_FILE="/etc/systemd/system/$SPOOLER_SERVICE_NAME.service"
# Configuration file values
J2F_USER="$(id -u)"
J2F_GROUP="$(id -g)"
# ----
# MAIN
# ----
# Install packages used with the webhook processor
sudo apt update
sudo apt install -y inotify-tools jq task-spooler
sudo apt clean
# Configure process service
sudo sh -c "cat > $SPOOLER_SERVICE_FILE" <<EOF
[Install]
WantedBy=multi-user.target
[Unit]
Description=json2file processor for $J2F_USER
After=docker.service
[Service]
Type=simple
User=$J2F_USER
Group=$J2F_GROUP
ExecStart=$SPOOLER_COMMAND
EOF
# Restart and enable service
sudo systemctl daemon-reload
sudo systemctl stop "$SPOOLER_SERVICE_NAME"   true
sudo systemctl start "$SPOOLER_SERVICE_NAME"
sudo systemctl enable "$SPOOLER_SERVICE_NAME"
# ----
# vim: ts=2:sw=2:et:ai:sts=2

The gitea webhook processorFinally, the script that processes the JSON files does the following:
  1. First, it checks if the repository and branch are right,
  2. Then, it fetches and checks out the commit referenced on the JSON file,
  3. Once the files are updated, compiles the site using hugo with docker compose,
  4. If the compilation succeeds the script renames directories to swap the old version of the site by the new one.
If there is a failure the script aborts but before doing it or if the swap succeeded the system sends an email to the configured address and/or the user that pushed updates to the repository with a log of what happened. The current script is this one:
blogops-webhook.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
# Values
REPO_REF="refs/heads/main"
REPO_CLONE_URL="https://gitea.mixinet.net/mixinet/blogops.git"
MAIL_PREFIX="[BLOGOPS-WEBHOOK] "
# Address that gets all messages, leave it empty if not wanted
MAIL_TO_ADDR="blogops@mixinet.net"
# If the following variable is set to 'true' the pusher gets mail on failures
MAIL_ERRFILE="false"
# If the following variable is set to 'true' the pusher gets mail on success
MAIL_LOGFILE="false"
# gitea's conf/app.ini value of NO_REPLY_ADDRESS, it is used for email domains
# when the KeepEmailPrivate option is enabled for a user
NO_REPLY_ADDRESS="noreply.example.org"
# Directories
BASE_DIR="/srv/blogops"
PUBLIC_DIR="$BASE_DIR/public"
NGINX_BASE_DIR="$BASE_DIR/nginx"
PUBLIC_HTML_DIR="$NGINX_BASE_DIR/public_html"
WEBHOOK_BASE_DIR="$BASE_DIR/webhook"
WEBHOOK_SPOOL_DIR="$WEBHOOK_BASE_DIR/spool"
WEBHOOK_ACCEPTED="$WEBHOOK_SPOOL_DIR/accepted"
WEBHOOK_DEPLOYED="$WEBHOOK_SPOOL_DIR/deployed"
WEBHOOK_REJECTED="$WEBHOOK_SPOOL_DIR/rejected"
WEBHOOK_TROUBLED="$WEBHOOK_SPOOL_DIR/troubled"
WEBHOOK_LOG_DIR="$WEBHOOK_SPOOL_DIR/log"
# Files
TODAY="$(date +%Y%m%d)"
OUTPUT_BASENAME="$(date +%Y%m%d-%H%M%S.%N)"
WEBHOOK_LOGFILE_PATH="$WEBHOOK_LOG_DIR/$OUTPUT_BASENAME.log"
WEBHOOK_ACCEPTED_JSON="$WEBHOOK_ACCEPTED/$OUTPUT_BASENAME.json"
WEBHOOK_ACCEPTED_LOGF="$WEBHOOK_ACCEPTED/$OUTPUT_BASENAME.log"
WEBHOOK_REJECTED_TODAY="$WEBHOOK_REJECTED/$TODAY"
WEBHOOK_REJECTED_JSON="$WEBHOOK_REJECTED_TODAY/$OUTPUT_BASENAME.json"
WEBHOOK_REJECTED_LOGF="$WEBHOOK_REJECTED_TODAY/$OUTPUT_BASENAME.log"
WEBHOOK_DEPLOYED_TODAY="$WEBHOOK_DEPLOYED/$TODAY"
WEBHOOK_DEPLOYED_JSON="$WEBHOOK_DEPLOYED_TODAY/$OUTPUT_BASENAME.json"
WEBHOOK_DEPLOYED_LOGF="$WEBHOOK_DEPLOYED_TODAY/$OUTPUT_BASENAME.log"
WEBHOOK_TROUBLED_TODAY="$WEBHOOK_TROUBLED/$TODAY"
WEBHOOK_TROUBLED_JSON="$WEBHOOK_TROUBLED_TODAY/$OUTPUT_BASENAME.json"
WEBHOOK_TROUBLED_LOGF="$WEBHOOK_TROUBLED_TODAY/$OUTPUT_BASENAME.log"
# Query to get variables from a gitea webhook json
ENV_VARS_QUERY="$(
  printf "%s" \
    '(.             @sh "gt_ref=\(.ref);"),' \
    '(.             @sh "gt_after=\(.after);"),' \
    '(.repository   @sh "gt_repo_clone_url=\(.clone_url);"),' \
    '(.repository   @sh "gt_repo_name=\(.name);"),' \
    '(.pusher       @sh "gt_pusher_full_name=\(.full_name);"),' \
    '(.pusher       @sh "gt_pusher_email=\(.email);")'
)"
# ---------
# Functions
# ---------
webhook_log()  
  echo "$(date -R) $*" >>"$WEBHOOK_LOGFILE_PATH"
 
webhook_check_directories()  
  for _d in "$WEBHOOK_SPOOL_DIR" "$WEBHOOK_ACCEPTED" "$WEBHOOK_DEPLOYED" \
    "$WEBHOOK_REJECTED" "$WEBHOOK_TROUBLED" "$WEBHOOK_LOG_DIR"; do
    [ -d "$_d" ]   mkdir "$_d"
  done
 
webhook_clean_directories()  
  # Try to remove empty dirs
  for _d in "$WEBHOOK_ACCEPTED" "$WEBHOOK_DEPLOYED" "$WEBHOOK_REJECTED" \
    "$WEBHOOK_TROUBLED" "$WEBHOOK_LOG_DIR" "$WEBHOOK_SPOOL_DIR"; do
    if [ -d "$_d" ]; then
      rmdir "$_d" 2>/dev/null   true
    fi
  done
 
webhook_accept()  
  webhook_log "Accepted: $*"
  mv "$WEBHOOK_JSON_INPUT_FILE" "$WEBHOOK_ACCEPTED_JSON"
  mv "$WEBHOOK_LOGFILE_PATH" "$WEBHOOK_ACCEPTED_LOGF"
  WEBHOOK_LOGFILE_PATH="$WEBHOOK_ACCEPTED_LOGF"
 
webhook_reject()  
  [ -d "$WEBHOOK_REJECTED_TODAY" ]   mkdir "$WEBHOOK_REJECTED_TODAY"
  webhook_log "Rejected: $*"
  if [ -f "$WEBHOOK_JSON_INPUT_FILE" ]; then
    mv "$WEBHOOK_JSON_INPUT_FILE" "$WEBHOOK_REJECTED_JSON"
  fi
  mv "$WEBHOOK_LOGFILE_PATH" "$WEBHOOK_REJECTED_LOGF"
  exit 0
 
webhook_deployed()  
  [ -d "$WEBHOOK_DEPLOYED_TODAY" ]   mkdir "$WEBHOOK_DEPLOYED_TODAY"
  webhook_log "Deployed: $*"
  mv "$WEBHOOK_ACCEPTED_JSON" "$WEBHOOK_DEPLOYED_JSON"
  mv "$WEBHOOK_ACCEPTED_LOGF" "$WEBHOOK_DEPLOYED_LOGF"
  WEBHOOK_LOGFILE_PATH="$WEBHOOK_DEPLOYED_LOGF"
 
webhook_troubled()  
  [ -d "$WEBHOOK_TROUBLED_TODAY" ]   mkdir "$WEBHOOK_TROUBLED_TODAY"
  webhook_log "Troubled: $*"
  mv "$WEBHOOK_ACCEPTED_JSON" "$WEBHOOK_TROUBLED_JSON"
  mv "$WEBHOOK_ACCEPTED_LOGF" "$WEBHOOK_TROUBLED_LOGF"
  WEBHOOK_LOGFILE_PATH="$WEBHOOK_TROUBLED_LOGF"
 
print_mailto()  
  _addr="$1"
  _user_email=""
  # Add the pusher email address unless it is from the domain NO_REPLY_ADDRESS,
  # which should match the value of that variable on the gitea 'app.ini' (it
  # is the domain used for emails when the user hides it).
  # shellcheck disable=SC2154
  if [ -n "$ gt_pusher_email##*@"$ NO_REPLY_ADDRESS " " ] &&
    [ -z "$ gt_pusher_email##*@* " ]; then
    _user_email="\"$gt_pusher_full_name <$gt_pusher_email>\""
  fi
  if [ "$_addr" ] && [ "$_user_email" ]; then
    echo "$_addr,$_user_email"
  elif [ "$_user_email" ]; then
    echo "$_user_email"
  elif [ "$_addr" ]; then
    echo "$_addr"
  fi
 
mail_success()  
  to_addr="$MAIL_TO_ADDR"
  if [ "$MAIL_LOGFILE" = "true" ]; then
    to_addr="$(print_mailto "$to_addr")"
  fi
  if [ "$to_addr" ]; then
    # shellcheck disable=SC2154
    subject="OK - $gt_repo_name updated to commit '$gt_after'"
    mail -s "$ MAIL_PREFIX $ subject " "$to_addr" \
      <"$WEBHOOK_LOGFILE_PATH"
  fi
 
mail_failure()  
  to_addr="$MAIL_TO_ADDR"
  if [ "$MAIL_ERRFILE" = true ]; then
    to_addr="$(print_mailto "$to_addr")"
  fi
  if [ "$to_addr" ]; then
    # shellcheck disable=SC2154
    subject="KO - $gt_repo_name update FAILED for commit '$gt_after'"
    mail -s "$ MAIL_PREFIX $ subject " "$to_addr" \
      <"$WEBHOOK_LOGFILE_PATH"
  fi
 
# ----
# MAIN
# ----
# Check directories
webhook_check_directories
# Go to the base directory
cd "$BASE_DIR"
# Check if the file exists
WEBHOOK_JSON_INPUT_FILE="$1"
if [ ! -f "$WEBHOOK_JSON_INPUT_FILE" ]; then
  webhook_reject "Input arg '$1' is not a file, aborting"
fi
# Parse the file
webhook_log "Processing file '$WEBHOOK_JSON_INPUT_FILE'"
eval "$(jq -r "$ENV_VARS_QUERY" "$WEBHOOK_JSON_INPUT_FILE")"
# Check that the repository clone url is right
# shellcheck disable=SC2154
if [ "$gt_repo_clone_url" != "$REPO_CLONE_URL" ]; then
  webhook_reject "Wrong repository: '$gt_clone_url'"
fi
# Check that the branch is the right one
# shellcheck disable=SC2154
if [ "$gt_ref" != "$REPO_REF" ]; then
  webhook_reject "Wrong repository ref: '$gt_ref'"
fi
# Accept the file
# shellcheck disable=SC2154
webhook_accept "Processing '$gt_repo_name'"
# Update the checkout
ret="0"
git fetch >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Repository fetch failed"
  mail_failure
fi
# shellcheck disable=SC2154
git checkout "$gt_after" >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Repository checkout failed"
  mail_failure
fi
# Remove the build dir if present
if [ -d "$PUBLIC_DIR" ]; then
  rm -rf "$PUBLIC_DIR"
fi
# Build site
docker compose run hugo -- >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
# go back to the main branch
git switch main && git pull
# Fail if public dir was missing
if [ "$ret" -ne "0" ]   [ ! -d "$PUBLIC_DIR" ]; then
  webhook_troubled "Site build failed"
  mail_failure
fi
# Remove old public_html copies
webhook_log 'Removing old site versions, if present'
find $NGINX_BASE_DIR -mindepth 1 -maxdepth 1 -name 'public_html-*' -type d \
  -exec rm -rf   \; >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Removal of old site versions failed"
  mail_failure
fi
# Switch site directory
TS="$(date +%Y%m%d-%H%M%S)"
if [ -d "$PUBLIC_HTML_DIR" ]; then
  webhook_log "Moving '$PUBLIC_HTML_DIR' to '$PUBLIC_HTML_DIR-$TS'"
  mv "$PUBLIC_HTML_DIR" "$PUBLIC_HTML_DIR-$TS" >>"$WEBHOOK_LOGFILE_PATH" 2>&1  
    ret="$?"
fi
if [ "$ret" -eq "0" ]; then
  webhook_log "Moving '$PUBLIC_DIR' to '$PUBLIC_HTML_DIR'"
  mv "$PUBLIC_DIR" "$PUBLIC_HTML_DIR" >>"$WEBHOOK_LOGFILE_PATH" 2>&1  
    ret="$?"
fi
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Site switch failed"
  mail_failure
else
  webhook_deployed "Site deployed successfully"
  mail_success
fi
# ----
# vim: ts=2:sw=2:et:ai:sts=2

18 May 2022

Reproducible Builds: Supporter spotlight: Jan Nieuwenhuizen on Bootstrappable Builds, GNU Mes and GNU Guix

The Reproducible Builds project relies on several projects, supporters and sponsors for financial support, but they are also valued as ambassadors who spread the word about our project and the work that we do. This is the fourth instalment in a series featuring the projects, companies and individuals who support the Reproducible Builds project. We started this series by featuring the Civil Infrastructure Platform project and followed this up with a post about the Ford Foundation as well as a recent ones about ARDC and the Google Open Source Security Team (GOSST). Today, however, we will be talking with Jan Nieuwenhuizen about Bootstrappable Builds, GNU Mes and GNU Guix.
Chris Lamb: Hi Jan, thanks for taking the time to talk with us today. First, could you briefly tell me about yourself? Jan: Thanks for the chat; it s been a while! Well, I ve always been trying to find something new and interesting that is just asking to be created but is mostly being overlooked. That s how I came to work on GNU Guix and create GNU Mes to address the bootstrapping problem that we have in free software. It s also why I have been working on releasing Dezyne, a programming language and set of tools to specify and formally verify concurrent software systems as free software. Briefly summarised, compilers are often written in the language they are compiling. This creates a chicken-and-egg problem which leads users and distributors to rely on opaque, pre-built binaries of those compilers that they use to build newer versions of the compiler. To gain trust in our computing platforms, we need to be able to tell how each part was produced from source, and opaque binaries are a threat to user security and user freedom since they are not auditable. The goal of bootstrappability (and the bootstrappable.org project in particular) is to minimise the amount of these bootstrap binaries. Anyway, after studying Physics at Eindhoven University of Technology (TU/e), I worked for digicash.com, a startup trying to create a digital and anonymous payment system sadly, however, a traditional account-based system won. Separate to this, as there was no software (either free or proprietary) to automatically create beautiful music notation, together with Han-Wen Nienhuys, I created GNU LilyPond. Ten years ago, I took the initiative to co-found a democratic school in Eindhoven based on the principles of sociocracy. And last Christmas I finally went vegan, after being mostly vegetarian for about 20 years!
Chris: For those who have not heard of it before, what is GNU Guix? What are the key differences between Guix and other Linux distributions? Jan: GNU Guix is both a package manager and a full-fledged GNU/Linux distribution. In both forms, it provides state-of-the-art package management features such as transactional upgrades and package roll-backs, hermetical-sealed build environments, unprivileged package management as well as per-user profiles. One obvious difference is that Guix forgoes the usual Filesystem Hierarchy Standard (ie. /usr, /lib, etc.), but there are other significant differences too, such as Guix being scriptable using Guile/Scheme, as well as Guix s dedication and focus on free software.
Chris: How does GNU Guix relate to GNU Mes? Or, rather, what problem is Mes attempting to solve? Jan: GNU Mes was created to address the security concerns that arise from bootstrapping an operating system such as Guix. Even if this process entirely involves free software (i.e. the source code is, at least, available), this commonly uses large and unauditable binary blobs. Mes is a Scheme interpreter written in a simple subset of C and a C compiler written in Scheme, and it comes with a small, bootstrappable C library. Twice, the Mes bootstrap has halved the size of opaque binaries that were needed to bootstrap GNU Guix. These reductions were achieved by first replacing GNU Binutils, GNU GCC and the GNU C Library with Mes, and then replacing Unix utilities such as awk, bash, coreutils, grep sed, etc., by Gash and Gash-Utils. The final goal of Mes is to help create a full-source bootstrap for any interested UNIX-like operating system.
Chris: What is the current status of Mes? Jan: Mes supports all that is needed from R5RS and GNU Guile to run MesCC with Nyacc, the C parser written for Guile, for 32-bit x86 and ARM. The next step for Mes would be more compatible with Guile, e.g., have guile-module support and support running Gash and Gash Utils. In working to create a full-source bootstrap, I have disregarded the kernel and Guix build system for now, but otherwise, all packages should be built from source, and obviously, no binary blobs should go in. We still need a Guile binary to execute some scripts, and it will take at least another one to two years to remove that binary. I m using the 80/20 approach, cutting corners initially to get something working and useful early. Another metric would be how many architectures we have. We are quite a way with ARM, tinycc now works, but there are still problems with GCC and Glibc. RISC-V is coming, too, which could be another metric. Someone has looked into picking up NixOS this summer. How many distros do anything about reproducibility or bootstrappability? The bootstrappability community is so small that we don t need metrics, sadly. The number of bytes of binary seed is a nice metric, but running the whole thing on a full-fledged Linux system is tough to put into a metric. Also, it is worth noting that I m developing on a modern Intel machine (ie. a platform with a management engine), that s another key component that doesn t have metrics.
Chris: From your perspective as a Mes/Guix user and developer, what does reproducibility mean to you? Are there any related projects? Jan: From my perspective, I m more into the problem of bootstrapping, and reproducibility is a prerequisite for bootstrappability. Reproducibility clearly takes a lot of effort to achieve, however. It s relatively easy to install some Linux distribution and be happy, but if you look at communities that really care about security, they are investing in reproducibility and other ways of improving the security of their supply chain. Projects I believe are complementary to Guix and Mes include NixOS, Debian and on the hardware side the RISC-V platform shares many of our core principles and goals.
Chris: Well, what are these principles and goals? Jan: Reproducibility and bootstrappability often feel like the next step in the frontier of free software. If you have all the sources and you can t reproduce a binary, that just doesn t feel right anymore. We should start to desire (and demand) transparent, elegant and auditable software stacks. To a certain extent, that s always been a low-level intent since the beginning of free software, but something clearly got lost along the way. On the other hand, if you look at the NPM or Rust ecosystems, we see a world where people directly install binaries. As they are not as supportive of copyleft as the rest of the free software community, you can see that movement and people in our area are doing more as a response to that so that what we have continues to remain free, and to prevent us from falling asleep and waking up in a couple of years and see, for example, Rust in the Linux kernel and (more importantly) we require big binary blobs to use our systems. It s an excellent time to advance right now, so we should get a foothold in and ensure we don t lose any more.
Chris: What would be your ultimate reproducibility goal? And what would the key steps or milestones be to reach that? Jan: The ultimate goal would be to have a system built with open hardware, with all software on it fully bootstrapped from its source. This bootstrap path should be easy to replicate and straightforward to inspect and audit. All fully reproducible, of course! In essence, we would have solved the supply chain security problem. Our biggest challenge is ignorance. There is much unawareness about the importance of what we are doing. As it is rather technical and doesn t really affect everyday computer use, that is not surprising. This unawareness can be a great force driving us in the opposite direction. Think of Rust being allowed in the Linux kernel, or Python being required to build a recent GNU C library (glibc). Also, the fact that companies like Google/Apple still want to play us vs them , not willing to to support GPL software. Not ready yet to truly support user freedom. Take the infamous log4j bug everyone is using open source these days, but nobody wants to take responsibility and help develop or nurture the community. Not ecosystem , as that s how it s being approached right now: live and let live/die: see what happens without taking any responsibility. We are growing and we are strong and we can do a lot but if we have to work against those powers, it can become problematic. So, let s spread our great message and get more people involved!
Chris: What has been your biggest win? Jan: From a technical point of view, the full-source bootstrap has have been our biggest win. A talk by Carl Dong at the 2019 Breaking Bitcoin conference stated that connecting Jeremiah Orian s Stage0 project to Mes would be the holy grail of bootstrapping, and we recently managed to achieve just that: in other words, starting from hex0, 357-byte binary, we can now build the entire Guix system. This past year we have not made significant visible progress, however, as our funding was unfortunately not there. The Stage0 project has advanced in RISC-V. A month ago, though, I secured NLnet funding for another year, and thanks to NLnet, Ekaitz Zarraga and Timothy Sample will work on GNU Mes and the Guix bootstrap as well. Separate to this, the bootstrappable community has grown a lot from two people it was six years ago: there are now currently over 100 people in the #bootstrappable IRC channel, for example. The enlarged community is possibly an even more important win going forward.
Chris: How realistic is a 100% bootstrappable toolchain? And from someone who has been working in this area for a while, is solving Trusting Trust) actually feasible in reality? Jan: Two answers: Yes and no, it really depends on your definition. One great thing is that the whole Stage0 project can also run on the Knight virtual machine, a hardware platform that was designed, I think, in the 1970s. I believe we can and must do better than we are doing today, and that there s a lot of value in it. The core issue is not the trust; we can probably all trust each other. On the other hand, we don t want to trust each other or even ourselves. I am not, personally, going to inspect my RISC-V laptop, and other people create the hardware and do probably not want to inspect the software. The answer comes back to being conscientious and doing what is right. Inserting GCC as a binary blob is not right. I think we can do better, and that s what I d like to do. The security angle is interesting, but I don t like the paranoid part of that; I like the beauty of what we are creating together and stepwise improving on that.
Chris: Thanks for taking the time to talk to us today. If someone wanted to get in touch or learn more about GNU Guix or Mes, where might someone go? Jan: Sure! First, check out: I m also on Twitter (@janneke_gnu) and on octodon.social (@janneke@octodon.social).
Chris: Thanks for taking the time to talk to us today. Jan: No problem. :)


For more information about the Reproducible Builds project, please see our website at reproducible-builds.org. If you are interested in ensuring the ongoing security of the software that underpins our civilisation and wish to sponsor the Reproducible Builds project, please reach out to the project by emailing contact@reproducible-builds.org.

16 May 2022

Matthew Garrett: Can we fix bearer tokens?

Last month I wrote about how bearer tokens are just awful, and a week later Github announced that someone had managed to exfiltrate bearer tokens from Heroku that gave them access to, well, a lot of Github repositories. This has inevitably resulted in a whole bunch of discussion about a number of things, but people seem to be largely ignoring the fundamental issue that maybe we just shouldn't have magical blobs that grant you access to basically everything even if you've copied them from a legitimate holder to Honest John's Totally Legitimate API Consumer.

To make it clearer what the problem is here, let's use an analogy. You have a safety deposit box. To gain access to it, you simply need to be able to open it with a key you were given. Anyone who turns up with the key can open the box and do whatever they want with the contents. Unfortunately, the key is extremely easy to copy - anyone who is able to get hold of your keyring for a moment is in a position to duplicate it, and then they have access to the box. Wouldn't it be better if something could be done to ensure that whoever showed up with a working key was someone who was actually authorised to have that key?

To achieve that we need some way to verify the identity of the person holding the key. In the physical world we have a range of ways to achieve this, from simply checking whether someone has a piece of ID that associates them with the safety deposit box all the way up to invasive biometric measurements that supposedly verify that they're definitely the same person. But computers don't have passports or fingerprints, so we need another way to identify them.

When you open a browser and try to connect to your bank, the bank's website provides a TLS certificate that lets your browser know that you're talking to your bank instead of someone pretending to be your bank. The spec allows this to be a bi-directional transaction - you can also prove your identity to the remote website. This is referred to as "mutual TLS", or mTLS, and a successful mTLS transaction ends up with both ends knowing who they're talking to, as long as they have a reason to trust the certificate they were presented with.

That's actually a pretty big constraint! We have a reasonable model for the server - it's something that's issued by a trusted third party and it's tied to the DNS name for the server in question. Clients don't tend to have stable DNS identity, and that makes the entire thing sort of awkward. But, thankfully, maybe we don't need to? We don't need the client to be able to prove its identity to arbitrary third party sites here - we just need the client to be able to prove it's a legitimate holder of whichever bearer token it's presenting to that site. And that's a much easier problem.

Here's the simple solution - clients generate a TLS cert. This can be self-signed, because all we want to do here is be able to verify whether the machine talking to us is the same one that had a token issued to it. The client contacts a service that's going to give it a bearer token. The service requests mTLS auth without being picky about the certificate that's presented. The service embeds a hash of that certificate in the token before handing it back to the client. Whenever the client presents that token to any other service, the service ensures that the mTLS cert the client presented matches the hash in the bearer token. Copy the token without copying the mTLS certificate and the token gets rejected. Hurrah hurrah hats for everyone.

Well except for the obvious problem that if you're in a position to exfiltrate the bearer tokens you can probably just steal the client certificates and keys as well, and now you can pretend to be the original client and this is not adding much additional security. Fortunately pretty much everything we care about has the ability to store the private half of an asymmetric key in hardware (TPMs on Linux and Windows systems, the Secure Enclave on Macs and iPhones, either a piece of magical hardware or Trustzone on Android) in a way that avoids anyone being able to just steal the key.

How do we know that the key is actually in hardware? Here's the fun bit - it doesn't matter. If you're issuing a bearer token to a system then you're already asserting that the system is trusted. If the system is lying to you about whether or not the key it's presenting is hardware-backed then you've already lost. If it lied and the system is later compromised then sure all your apes get stolen, but maybe don't run systems that lie and avoid that situation as a result?

Anyway. This is covered in RFC 8705 so why aren't we all doing this already? From the client side, the largest generic issue is that TPMs are astonishingly slow in comparison to doing a TLS handshake on the CPU. RSA signing operations on TPMs can take around half a second, which doesn't sound too bad, except your browser is probably establishing multiple TLS connections to subdomains on the site it's connecting to and performance is going to tank. Fixing this involves doing whatever's necessary to convince the browser to pipe everything over a single TLS connection, and that's just not really where the web is right at the moment. Using EC keys instead helps a lot (~0.1 seconds per signature on modern TPMs), but it's still going to be a bottleneck.

The other problem, of course, is that ecosystem support for hardware-backed certificates is just awful. Windows lets you stick them into the standard platform certificate store, but the docs for this are hidden in a random PDF in a Github repo. Macs require you to do some weird bridging between the Secure Enclave API and the keychain API. Linux? Well, the standard answer is to do PKCS#11, and I have literally never met anybody who likes PKCS#11 and I have spent a bunch of time in standards meetings with the sort of people you might expect to like PKCS#11 and even they don't like it. It turns out that loading a bunch of random C bullshit that has strong feelings about function pointers into your security critical process is not necessarily something that is going to improve your quality of life, so instead you should use something like this and just have enough C to bridge to a language that isn't secretly plotting to kill your pets the moment you turn your back.

And, uh, obviously none of this matters at all unless people actually support it. Github has no support at all for validating the identity of whoever holds a bearer token. Most issuers of bearer tokens have no support for embedding holder identity into the token. This is not good! As of last week, all three of the big cloud providers support virtualised TPMs in their VMs - we should be running CI on systems that can do that, and tying any issued tokens to the VMs that are supposed to be making use of them.

So sure this isn't trivial. But it's also not impossible, and making this stuff work would improve the security of, well, everything. We literally have the technology to prevent attacks like Github suffered. What do we have to do to get people to actually start working on implementing that?

comment count unavailable comments

17 April 2022

Matthew Garrett: The Freedom Phone is not great at privacy

The Freedom Phone advertises itself as a "Free speech and privacy first focused phone". As documented on the features page, it runs ClearOS, an Android-based OS produced by Clear United (or maybe one of the bewildering array of associated companies, we'll come back to that later). It's advertised as including Signal, but what's shipped is not the version available from the Signal website or any official app store - instead it's this fork called "ClearSignal".

The first thing to note about ClearSignal is that the privacy policy link from that page 404s, which is not a great start. The second thing is that it has a version number of 5.8.14, which is strange because upstream went from 5.8.10 to 5.9.0. The third is that, despite Signal being GPL 3, there's no source code available. So, I grabbed jadx and started looking for differences between ClearSignal and the upstream 5.8.10 release. The results were, uh, surprising.

First up is that they seem to have integrated ACRA, a crash reporting framework. This feels a little odd - in the absence of a privacy policy, it's unclear what information this gathers or how it'll be stored. Having a piece of privacy software automatically uploading information about what you were doing in the event of a crash with no notification other than a toast that appears saying "Crash Report" feels a little dubious.

Next is that Signal (for fairly obvious reasons) warns you if your version is out of date and eventually refuses to work unless you upgrade. ClearSignal has dealt with this problem by, uh, simply removing that code. The MacOS version of the desktop app they provide for download seems to be derived from a release from last September, which for an Electron-based app feels like a pretty terrible idea. Weirdly, for Windows they link to an official binary release from February 2021, and for Linux they tell you how to use the upstream repo properly. I have no idea what's going on here.

They've also added support for network backups of your Signal data. This involves the backups being pushed to an S3 bucket using credentials that are statically available in the app. It's ok, though, each upload has some sort of nominally unique identifier associated with it, so it's not trivial to just download other people's backups. But, uh, where does this identifier come from? It turns out that Clear Center, another of the Clear family of companies, employs a bunch of people to work on a ClearID[1], some sort of decentralised something or other that seems to be based on KERI. There's an overview slide deck here which didn't really answer any of my questions and as far as I can tell this is entirely lacking any sort of peer review, but hey it's only the one thing that stops anyone on the internet being able to grab your Signal backups so how important can it be.

The final thing, though? They've extended Signal's invitation support to encourage users to get others to sign up for Clear United. There's an exposed API endpoint called "get_user_email_by_mobile_number" which does exactly what you'd expect - if you give it a registered phone number, it gives you back the associated email address. This requires no authentication. But it gets better! The API to generate a referral link to send to others sends the name and phone number of everyone in your phone's contact list. There does not appear to be any indication that this is going to happen.

So, from a privacy perspective, going to go with things being some distance from ideal. But what's going on with all these Clear companies anyway? They all seem to be related to Michael Proper, who founded the Clear Foundation in 2009. They are, perhaps unsurprisingly, heavily invested in blockchain stuff, while Clear United also appears to be some sort of multi-level marketing scheme which has a membership agreement that includes the somewhat astonishing claim that:

Specifically, the initial focus of the Association will provide members with supplements and technologies for:

9a. Frequency Evaluation, Scans, Reports;

9b. Remote Frequency Health Tuning through Quantum Entanglement;

9c. General and Customized Frequency Optimizations;


- there's more discussion of this and other weirdness here. Clear Center, meanwhile, has a Chief Physics Officer? I have a lot of questions.

Anyway. We have a company that seems to be combining blockchain and MLM, has some opinions about Quantum Entanglement, bases the security of its platform on a set of novel cryptographic primitives that seem to have had no external review, has implemented an API that just hands out personal information without any authentication and an app that appears more than happy to upload all your contact details without telling you first, has failed to update this app to keep up with upstream security updates, and is violating the upstream license. If this is their idea of "privacy first", I really hate to think what their code looks like when privacy comes further down the list.

[1] Pointed out to me here

comment count unavailable comments

5 April 2022

Matthew Garrett: Bearer tokens are just awful

As I mentioned last time, bearer tokens are not super compatible with a model in which every access is verified to ensure it's coming from a trusted device. Let's talk about that in a bit more detail.

First off, what is a bearer token? In its simplest form, it's simply an opaque blob that you give to a user after an authentication or authorisation challenge, and then they show it to you to prove that they should be allowed access to a resource. In theory you could just hand someone a randomly generated blob, but then you'd need to keep track of which blobs you've issued and when they should be expired and who they correspond to, so frequently this is actually done using JWTs which contain some base64 encoded JSON that describes the user and group membership and so on and then have a signature associated with them so whenever the user presents one you can just validate the signature and then assume that the contents of the JSON are trustworthy.

One thing to note here is that the crypto is purely between whoever issued the token and whoever validates the token - as far as the server is concerned, any client who can just show it the token is just fine as long as the signature is verified. There's no way to verify the client's state, so one of the core ideas of Zero Trust (that we verify that the client is in a trustworthy state on every access) is already violated.

Can we make things not terrible? Sure! We may not be able to validate the client state on every access, but we can validate the client state when we issue the token in the first place. When the user hits a login page, we do state validation according to whatever policy we want to enforce, and if the client violates that policy we refuse to issue a token to it. If the token has a sufficiently short lifetime then an attacker is only going to have a short period of time to use that token before it expires and then (with luck) they won't be able to get a new one because the state validation will fail.

Except! This is fine for cases where we control the issuance flow. What if we have a scenario where a third party authenticates the client (by verifying that they have a valid token issued by their ID provider) and then uses that to issue their own token that's much longer lived? Well, now the client has a long-lived token sitting on it. And if anyone copies that token to another device, they can now pretend to be that client.

This is, sadly, depressingly common. A lot of services will verify the user, and then issue an oauth token that'll expire some time around the heat death of the universe. If a client system is compromised and an attacker just copies that token to another system, they can continue to pretend to be the legitimate user until someone notices (which, depending on whether or not the service in question has any sort of audit logs, and whether you're paying any attention to them, may be once screenshots of your data show up on Twitter).

This is a problem! There's no way to fit a hosted service that behaves this way into a Zero Trust model - the best you can say is that a token was issued to a device that was, around that time, apparently trustworthy, and now it's some time later and you have literally no idea whether the device is still trustworthy or if the token is still even on that device.

But wait, there's more! Even if you're nowhere near doing any sort of Zero Trust stuff, imagine the case of a user having a bunch of tokens from multiple services on their laptop, and then they leave their laptop unlocked in a cafe while they head to the toilet and whoops it's not there any more, better assume that someone has access to all the data on there. How many services has our opportunistic new laptop owner gained access to as a result? How do we revoke all of the tokens that are sitting there on the local disk? Do you even have a policy for dealing with that?

There isn't a simple answer to all of these problems. Replacing bearer tokens with some sort of asymmetric cryptographic challenge to the client would at least let us tie the tokens to a TPM or other secure enclave, and then we wouldn't have to worry about them being copied elsewhere. But that wouldn't help us if the client is compromised and the attacker simply keeps using the compromised client. The entire model of simply proving knowledge of a secret being sufficient to gain access to a resource is inherently incompatible with a desire for fine-grained trust verification on every access, but I don't see anything changing until we have a standard for third party services to be able to perform that trust verification against a customer's policy.

Still, at least this means I can just run weird Android IoT apps through mitmproxy, pull the bearer token out of the request headers and then start poking the remote API with curl. It may all be broken, but it's also got me a bunch of bug bounty credit, so, it;s impossible to say if its bad or not,

(Addendum: this suggestion that we solve the hardware binding problem by simply passing all the network traffic through some sort of local enclave that could see tokens being set and would then sequester them and reinject them into later requests is OBVIOUSLY HORRIFYING and is also probably going to be at least three startup pitches by the end of next week)

comment count unavailable comments

1 April 2022

Paul Wise: FLOSS Activities March 2022

Focus This month I didn't have any particular focus. I just worked on issues in my info bubble.

Changes

Issues

Review
  • Spam: reported 3 Debian bug reports and 53 Debian mailing list posts
  • Debian wiki: RecentChanges for the month
  • Debian BTS usertags: changes for the month
  • Debian screenshots:

Administration
  • Debian servers: investigate wiki mail delivery issue, restart backup director
  • Debian wiki: unblock IP addresses, approve accounts

Communication
  • Forward python-plac test failure issue upstream
  • Participate in Debian Project Leader election discussions
  • Respond to queries from Debian users and contributors on the mailing lists and IRC

Sponsors The oci-python-sdk and plac work was sponsored. All other work was done on a volunteer basis.

31 March 2022

Matthew Garrett: ZTA doesn't solve all problems, but partial implementations solve fewer

Traditional network access controls work by assuming that something is trustworthy based on some other factor - for example, if a computer is on your office network, it's trustworthy because only trustworthy people should be able to gain physical access to plug something in. If you restrict access to your services to requests coming from trusted networks, then you can assert that it's coming from a trusted device.

Of course, this isn't necessarily true. A machine on your office network may be compromised. An attacker may obtain valid VPN credentials. Someone could leave a hostile device plugged in under a desk in a meeting room. Trust is being placed in devices that may not be trustworthy.

A Zero Trust Architecture (ZTA) is one where a device is granted no inherent trust. Instead, each access to a service is validated against some policy - if the policy is satisfied, the access is permitted. A typical implementation involves granting each device some sort of cryptographic identity (typically a TLS client certificate) and placing the protected services behind a proxy. The proxy verifies the device identity, queries another service to obtain the current device state (we'll come back to that in a moment), compares the state against a policy and either pass the request through to the service or reject it. Different services can have different policies (eg, you probably want a lax policy around whatever's hosting the documentation for how to fix your system if it's being refused access to something for being in the wrong state), and if you want you can also tie it to proof of user identity in some way.

From a user perspective, this is entirely transparent. The proxy is made available on the public internet, DNS for the services points to the proxy, and every time your users try to access the service they hit the proxy instead and (if everything's ok) gain access to it no matter which network they're on. There's no need to connect to a VPN first, and there's no worries about accidentally leaking information over the public internet instead of over a secure link.

It's also notable that traditional solutions tend to be all-or-nothing. If I have some services that are more sensitive than others, the only way I can really enforce this is by having multiple different VPNs and only granting access to sensitive services from specific VPNs. This obviously risks combinatorial explosion once I have more than a couple of policies, and it's a terrible user experience.

Overall, ZTA approaches provide more security and an improved user experience. So why are we still using VPNs? Primarily because this is all extremely difficult. Let's take a look at an extremely recent scenario. A device used by customer support technicians was compromised. The vendor in question has a solution that can tie authentication decisions to whether or not a device has a cryptographic identity. If this was in use, and if the cryptographic identity was tied to the device hardware (eg, by being generated in a TPM), the attacker would not simply be able to obtain the user credentials and log in from their own device. This is good - if the attacker wanted to maintain access to the service, they needed to stay on the device in question. This increases the probability of the monitoring tooling on the compromised device noticing them.

Unfortunately, the attacker simply disabled the monitoring tooling on the compromised device. If device state was being verified on each access then this would be noticed before too long - the last data received from the device would be flagged as too old, and the requests would no longer satisfy any reasonable access control policy. Instead, the device was assumed to be trustworthy simply because it could demonstrate its identity. There's an important point here: just because a device belongs to you doesn't mean it's a trustworthy device.

So, if ZTA approaches are so powerful and user-friendly, why aren't we all using one? There's a few problems, but the single biggest is that there's no standardised way to verify device state in any meaningful way. Remote Attestation can both prove device identity and the device boot state, but the only product on the market that does much with this is Microsoft's Device Health Attestation. DHA doesn't solve the broader problem of also reporting runtime state - it may be able to verify that endpoint monitoring was launched, but it doesn't make assertions about whether it's still running. Right now, people are left trying to scrape this information from whatever tooling they're running. The absence of any standardised approach to this problem means anyone who wants to deploy a strong ZTA has to integrate with whatever tooling they're already running, and that then increases the cost of migrating to any other tooling later.

But even device identity is hard! Knowing whether a machine should be given a certificate or not depends on knowing whether or not you own it, and inventory control is a surprisingly difficult problem in a lot of environments. It's not even just a matter of whether a machine should be given a certificate in the first place - if a machine is reported as lost or stolen, its trust should be revoked. Your inventory system needs to tie into your device state store in order to ensure that your proxies drop access.

And, worse, all of this depends on you being able to put stuff behind a proxy in the first place! If you're using third-party hosted services, that's a problem. In the absence of a proxy, trust decisions are probably made at login time. It's possible to tie user auth decisions to device identity and state (eg, a self-hosted SAML endpoint could do that before passing through to the actual ID provider), but that's still going to end up providing a bearer token of some sort that can potentially be exfiltrated, and will continue to be trusted even if the device state becomes invalid.

ZTA doesn't solve all problems, and there isn't a clear path to it doing so without significantly greater industry support. But a complete ZTA solution is significantly more powerful than a partial one. Verifying device identity is a step on the path to ZTA, but in the absence of device state verification it's only a step.

comment count unavailable comments

23 March 2022

Matthew Garrett: AMD's Pluton implementation seems to be controllable

I've been digging through the firmware for an AMD laptop with a Ryzen 6000 that incorporates Pluton for the past couple of weeks, and I've got some rough conclusions. Note that these are extremely preliminary and may not be accurate, but I'm going to try to encourage others to look into this in more detail. For those of you at home, I'm using an image from here, specifically version 309. The installer is happy to run under Wine, and if you tell it to "Extract" rather than "Install" it'll leave a file sitting in C:\\DRIVERS\ASUS_GA402RK_309_BIOS_Update_20220322235241 which seems to have an additional 2K of header on it. Strip that and you should have something approximating a flash image.

Looking for UTF16 strings in this reveals something interesting:

Pluton (HSP) X86 Firmware Support
Enable/Disable X86 firmware HSP related code path, including AGESA HSP module, SBIOS HSP related drivers.
Auto - Depends on PcdAmdHspCoreEnable build value
NOTE: PSP directory entry 0xB BIT36 have the highest priority.
NOTE: This option will NOT put HSP hardware in disable state, to disable HSP hardware, you need setup PSP directory entry 0xB, BIT36 to 1.
// EntryValue[36] = 0: Enable, HSP core is enabled.
// EntryValue[36] = 1: Disable, HSP core is disabled then PSP will gate the HSP clock, no further PSP to HSP commands. System will boot without HSP.

"HSP" here means "Hardware Security Processor" - a generic term that refers to Pluton in this case. This is a configuration setting that determines whether Pluton is "enabled" or not - my interpretation of this is that it doesn't directly influence Pluton, but disables all mechanisms that would allow the OS to communicate with it. In this scenario, Pluton has its firmware loaded and could conceivably be functional if the OS knew how to speak to it directly, but the firmware will never speak to it itself. I took a quick look at the Windows drivers for Pluton and it looks like they won't do anything unless the firmware wants to expose Pluton, so this should mean that Windows will do nothing.

So what about the reference to "PSP directory entry 0xB BIT36 have the highest priority"? The PSP is the AMD Platform Security Processor - it's an ARM core on the CPU package that boots before the x86. The PSP firmware lives in the same flash image as the x86 firmware, so the PSP looks for a header that points it towards the firmware it should execute. This gives a pointer to a "directory" - a list of different object types and where they're located in flash (there's a description of this for slightly older AMDs here). Type 0xb is treated slightly specially. Where most types contain the address of where the actual object is, type 0xb contains a 64-bit value that's interpreted as enabling or disabling various features - something AMD calls "soft fusing" (Intel have something similar that involves setting bits in the Firmware Interface Table). The PSP looks at the bits that are set here and alters its behaviour. If bit 36 is set, the PSP tells Pluton to turn itself off and will no longer send any commands to it.

So, we have two mechanisms to disable Pluton - the PSP can tell it to turn itself off, or the x86 firmware can simply never speak to it or admit that it exists. Both of these imply that Pluton has started executing before it's shut down, so it's reasonable to wonder whether it can still do stuff. In the image I'm looking at, there's a blob starting at 0x0069b610 that appears to be firmware for Pluton - it contains chunks that appear to be the reference TPM2 implementation, and it broadly decompiles as valid ARM code. It should be viable to figure out whether it can do anything in the face of being "disabled" via either of the above mechanisms.

Unfortunately for me, the system I'm looking at does set bit 36 in the 0xb entry - as a result, Pluton is disabled before x86 code starts running and I can't investigate further in any straightforward way. The implication that the user-controllable mechanism for disabling Pluton merely disables x86 communication with it rather than turning it off entirely is a little concerning, although (assuming Pluton is behaving as a TPM rather than having an enhanced set of capabilities) skipping any firmware communication means the OS has no way to know what happened before it started running even if it has a mechanism to communicate with Pluton without firmware assistance. In that scenario it'd be viable to write a bootloader shim that just faked up the firmware measurements before handing control to the OS.

The bit 36 disabling mechanism seems more solid? Again, it should be possible to analyse the Pluton firmware to determine whether it actually pays attention to a disable command being sent. But even if it chooses to ignore that, if the PSP is in a position to just cut the clock to Pluton, it's not going to be able to do a lot. At that point we're trusting AMD rather than trusting Microsoft, but given that you're also trusting AMD to execute the code you're giving them to execute, it's hard to avoid placing trust in them.

Overall: I'm reasonably confident that systems that ship with Pluton disabled via setting bit 36 in the soft fuses are going to disable it sufficiently hard that the OS can't do anything about it. Systems that give the user an option to enable or disable it are a little less clear in that respect, and it's possible (but not yet demonstrated) that an OS could communicate with Pluton anyway. However, if that's true, and if the firmware never communicates with Pluton itself, the user could install a stub loader in UEFI that mimicks the firmware behaviour and leaves the OS thinking everything was good when it absolutely is not.

So, assuming that Pluton in its current form on AMD has no capabilities outside those we know about, the disabling mechanisms are probably good enough. It's tough to make a firm statement on this before I have access to a system that doesn't just disable it immediately, so stay tuned for updates.

comment count unavailable comments

1 March 2022

Paul Wise: FLOSS Activities February 2022

Focus This month I didn't have any particular focus. I just worked on issues in my info bubble.

Changes

Issues

Review

Administration
  • Debian BTS: fix usertags for some users, unarchive/reopen/triage bugs for reintroduced packages: horae dh-haskell shutter logging-tree openstreetmap-carto gnome-commander
  • Debian servers: restore wiki data from backup, restarted bacula director for TLS cert update
  • Debian wiki: restore wiki account from backup, approve accounts

Communication
  • Respond to queries from Debian users and contributors on the mailing lists and IRC

Sponsors The purple-discord, gensim, DiskANN, SPTAG, wsproto work was sponsored by my employer. All other work was done on a volunteer basis.

26 January 2022

Timo Jyrinki: Unboxing Dell XPS 13 - openSUSE Tumbleweed alongside preinstalled Ubuntu

A look at the 2021 model of Dell XPS 13 - available with Linux pre-installed
I received a new laptop for work - a Dell XPS 13. Dell has been long famous for offering certain models with pre-installed Linux as a supported option, and opting for those is nice for moving some euros/dollars from certain PC desktop OS monopoly towards Linux desktop engineering costs. Notably Lenovo also offers Ubuntu and Fedora options on many models these days (like Carbon X1 and P15 Gen 2).
black box

opened box

accessories and a leaflet about Linux support

laptop lifted from the box, closed

laptop with lid open

Ubuntu running

openSUSE runnin
Obviously a smooth, ready-to-rock Ubuntu installation is nice for most people already, but I need openSUSE, so after checking everything is fine with Ubuntu, I continued to install openSUSE Tumbleweed as a dual boot option. As I m a funny little tinkerer, I obviously went with some special things. I wanted:
  • Ubuntu to remain as the reference supported OS on a small(ish) partition, useful to compare to if trying out new development versions of software on openSUSE and finding oddities.
  • openSUSE as the OS consuming most of the space.
  • LUKS encryption for openSUSE without LVM.
  • ext4 s new fancy fast_commit feature in use during filesystem creation.
  • As a result of all that, I ended up juggling back and forth installation screens a couple of times (even more than shown below, and also because I forgot I wanted to use encryption the first time around).
First boots to pre-installed Ubuntu and installation of openSUSE Tumbleweed as the dual-boot option:
(if the embedded video is not shown, use a direct link)
Some notes from the openSUSE installation:
  • openSUSE installer s partition editor apparently does not support resizing or automatically installing side-by-side another Linux distribution, so I did part of the setup completely on my own.
  • Installation package download hanged a couple of times, only passed when I entered a mirror manually. On my TW I ve also noticed download problems recently, there might be a problem with some mirror I need to escalate.
  • The installer doesn t very clearly show encryption status of the target installation - it took me a couple of attempts before I even noticed the small encrypted column and icon (well, very small, see below), which also did not spell out the device mapper name but only the main partition name. In the end it was going to do the right thing right away and use my pre-created encrypted target partition as I wanted, but it could be a better UX. Then again I was doing my very own tweaks anyway.
  • Let s not go to the details why I m so old-fashioned and use ext4 :)
  • openSUSE s installer does not work fine with HiDPI screen. Funnily the tty consoles seem to be fine and with a big font.
  • At the end of the video I install the two GNOME extensions I can t live without, Dash to Dock and Sound Input & Output Device Chooser.

Next.