Search Results: "mjg59"

16 May 2022

Matthew Garrett: Can we fix bearer tokens?

Last month I wrote about how bearer tokens are just awful, and a week later Github announced that someone had managed to exfiltrate bearer tokens from Heroku that gave them access to, well, a lot of Github repositories. This has inevitably resulted in a whole bunch of discussion about a number of things, but people seem to be largely ignoring the fundamental issue that maybe we just shouldn't have magical blobs that grant you access to basically everything even if you've copied them from a legitimate holder to Honest John's Totally Legitimate API Consumer.

To make it clearer what the problem is here, let's use an analogy. You have a safety deposit box. To gain access to it, you simply need to be able to open it with a key you were given. Anyone who turns up with the key can open the box and do whatever they want with the contents. Unfortunately, the key is extremely easy to copy - anyone who is able to get hold of your keyring for a moment is in a position to duplicate it, and then they have access to the box. Wouldn't it be better if something could be done to ensure that whoever showed up with a working key was someone who was actually authorised to have that key?

To achieve that we need some way to verify the identity of the person holding the key. In the physical world we have a range of ways to achieve this, from simply checking whether someone has a piece of ID that associates them with the safety deposit box all the way up to invasive biometric measurements that supposedly verify that they're definitely the same person. But computers don't have passports or fingerprints, so we need another way to identify them.

When you open a browser and try to connect to your bank, the bank's website provides a TLS certificate that lets your browser know that you're talking to your bank instead of someone pretending to be your bank. The spec allows this to be a bi-directional transaction - you can also prove your identity to the remote website. This is referred to as "mutual TLS", or mTLS, and a successful mTLS transaction ends up with both ends knowing who they're talking to, as long as they have a reason to trust the certificate they were presented with.

That's actually a pretty big constraint! We have a reasonable model for the server - it's something that's issued by a trusted third party and it's tied to the DNS name for the server in question. Clients don't tend to have stable DNS identity, and that makes the entire thing sort of awkward. But, thankfully, maybe we don't need to? We don't need the client to be able to prove its identity to arbitrary third party sites here - we just need the client to be able to prove it's a legitimate holder of whichever bearer token it's presenting to that site. And that's a much easier problem.

Here's the simple solution - clients generate a TLS cert. This can be self-signed, because all we want to do here is be able to verify whether the machine talking to us is the same one that had a token issued to it. The client contacts a service that's going to give it a bearer token. The service requests mTLS auth without being picky about the certificate that's presented. The service embeds a hash of that certificate in the token before handing it back to the client. Whenever the client presents that token to any other service, the service ensures that the mTLS cert the client presented matches the hash in the bearer token. Copy the token without copying the mTLS certificate and the token gets rejected. Hurrah hurrah hats for everyone.

Well except for the obvious problem that if you're in a position to exfiltrate the bearer tokens you can probably just steal the client certificates and keys as well, and now you can pretend to be the original client and this is not adding much additional security. Fortunately pretty much everything we care about has the ability to store the private half of an asymmetric key in hardware (TPMs on Linux and Windows systems, the Secure Enclave on Macs and iPhones, either a piece of magical hardware or Trustzone on Android) in a way that avoids anyone being able to just steal the key.

How do we know that the key is actually in hardware? Here's the fun bit - it doesn't matter. If you're issuing a bearer token to a system then you're already asserting that the system is trusted. If the system is lying to you about whether or not the key it's presenting is hardware-backed then you've already lost. If it lied and the system is later compromised then sure all your apes get stolen, but maybe don't run systems that lie and avoid that situation as a result?

Anyway. This is covered in RFC 8705 so why aren't we all doing this already? From the client side, the largest generic issue is that TPMs are astonishingly slow in comparison to doing a TLS handshake on the CPU. RSA signing operations on TPMs can take around half a second, which doesn't sound too bad, except your browser is probably establishing multiple TLS connections to subdomains on the site it's connecting to and performance is going to tank. Fixing this involves doing whatever's necessary to convince the browser to pipe everything over a single TLS connection, and that's just not really where the web is right at the moment. Using EC keys instead helps a lot (~0.1 seconds per signature on modern TPMs), but it's still going to be a bottleneck.

The other problem, of course, is that ecosystem support for hardware-backed certificates is just awful. Windows lets you stick them into the standard platform certificate store, but the docs for this are hidden in a random PDF in a Github repo. Macs require you to do some weird bridging between the Secure Enclave API and the keychain API. Linux? Well, the standard answer is to do PKCS#11, and I have literally never met anybody who likes PKCS#11 and I have spent a bunch of time in standards meetings with the sort of people you might expect to like PKCS#11 and even they don't like it. It turns out that loading a bunch of random C bullshit that has strong feelings about function pointers into your security critical process is not necessarily something that is going to improve your quality of life, so instead you should use something like this and just have enough C to bridge to a language that isn't secretly plotting to kill your pets the moment you turn your back.

And, uh, obviously none of this matters at all unless people actually support it. Github has no support at all for validating the identity of whoever holds a bearer token. Most issuers of bearer tokens have no support for embedding holder identity into the token. This is not good! As of last week, all three of the big cloud providers support virtualised TPMs in their VMs - we should be running CI on systems that can do that, and tying any issued tokens to the VMs that are supposed to be making use of them.

So sure this isn't trivial. But it's also not impossible, and making this stuff work would improve the security of, well, everything. We literally have the technology to prevent attacks like Github suffered. What do we have to do to get people to actually start working on implementing that?

comment count unavailable comments

17 April 2022

Matthew Garrett: The Freedom Phone is not great at privacy

The Freedom Phone advertises itself as a "Free speech and privacy first focused phone". As documented on the features page, it runs ClearOS, an Android-based OS produced by Clear United (or maybe one of the bewildering array of associated companies, we'll come back to that later). It's advertised as including Signal, but what's shipped is not the version available from the Signal website or any official app store - instead it's this fork called "ClearSignal".

The first thing to note about ClearSignal is that the privacy policy link from that page 404s, which is not a great start. The second thing is that it has a version number of 5.8.14, which is strange because upstream went from 5.8.10 to 5.9.0. The third is that, despite Signal being GPL 3, there's no source code available. So, I grabbed jadx and started looking for differences between ClearSignal and the upstream 5.8.10 release. The results were, uh, surprising.

First up is that they seem to have integrated ACRA, a crash reporting framework. This feels a little odd - in the absence of a privacy policy, it's unclear what information this gathers or how it'll be stored. Having a piece of privacy software automatically uploading information about what you were doing in the event of a crash with no notification other than a toast that appears saying "Crash Report" feels a little dubious.

Next is that Signal (for fairly obvious reasons) warns you if your version is out of date and eventually refuses to work unless you upgrade. ClearSignal has dealt with this problem by, uh, simply removing that code. The MacOS version of the desktop app they provide for download seems to be derived from a release from last September, which for an Electron-based app feels like a pretty terrible idea. Weirdly, for Windows they link to an official binary release from February 2021, and for Linux they tell you how to use the upstream repo properly. I have no idea what's going on here.

They've also added support for network backups of your Signal data. This involves the backups being pushed to an S3 bucket using credentials that are statically available in the app. It's ok, though, each upload has some sort of nominally unique identifier associated with it, so it's not trivial to just download other people's backups. But, uh, where does this identifier come from? It turns out that Clear Center, another of the Clear family of companies, employs a bunch of people to work on a ClearID[1], some sort of decentralised something or other that seems to be based on KERI. There's an overview slide deck here which didn't really answer any of my questions and as far as I can tell this is entirely lacking any sort of peer review, but hey it's only the one thing that stops anyone on the internet being able to grab your Signal backups so how important can it be.

The final thing, though? They've extended Signal's invitation support to encourage users to get others to sign up for Clear United. There's an exposed API endpoint called "get_user_email_by_mobile_number" which does exactly what you'd expect - if you give it a registered phone number, it gives you back the associated email address. This requires no authentication. But it gets better! The API to generate a referral link to send to others sends the name and phone number of everyone in your phone's contact list. There does not appear to be any indication that this is going to happen.

So, from a privacy perspective, going to go with things being some distance from ideal. But what's going on with all these Clear companies anyway? They all seem to be related to Michael Proper, who founded the Clear Foundation in 2009. They are, perhaps unsurprisingly, heavily invested in blockchain stuff, while Clear United also appears to be some sort of multi-level marketing scheme which has a membership agreement that includes the somewhat astonishing claim that:

Specifically, the initial focus of the Association will provide members with supplements and technologies for:

9a. Frequency Evaluation, Scans, Reports;

9b. Remote Frequency Health Tuning through Quantum Entanglement;

9c. General and Customized Frequency Optimizations;


- there's more discussion of this and other weirdness here. Clear Center, meanwhile, has a Chief Physics Officer? I have a lot of questions.

Anyway. We have a company that seems to be combining blockchain and MLM, has some opinions about Quantum Entanglement, bases the security of its platform on a set of novel cryptographic primitives that seem to have had no external review, has implemented an API that just hands out personal information without any authentication and an app that appears more than happy to upload all your contact details without telling you first, has failed to update this app to keep up with upstream security updates, and is violating the upstream license. If this is their idea of "privacy first", I really hate to think what their code looks like when privacy comes further down the list.

[1] Pointed out to me here

comment count unavailable comments

5 April 2022

Matthew Garrett: Bearer tokens are just awful

As I mentioned last time, bearer tokens are not super compatible with a model in which every access is verified to ensure it's coming from a trusted device. Let's talk about that in a bit more detail.

First off, what is a bearer token? In its simplest form, it's simply an opaque blob that you give to a user after an authentication or authorisation challenge, and then they show it to you to prove that they should be allowed access to a resource. In theory you could just hand someone a randomly generated blob, but then you'd need to keep track of which blobs you've issued and when they should be expired and who they correspond to, so frequently this is actually done using JWTs which contain some base64 encoded JSON that describes the user and group membership and so on and then have a signature associated with them so whenever the user presents one you can just validate the signature and then assume that the contents of the JSON are trustworthy.

One thing to note here is that the crypto is purely between whoever issued the token and whoever validates the token - as far as the server is concerned, any client who can just show it the token is just fine as long as the signature is verified. There's no way to verify the client's state, so one of the core ideas of Zero Trust (that we verify that the client is in a trustworthy state on every access) is already violated.

Can we make things not terrible? Sure! We may not be able to validate the client state on every access, but we can validate the client state when we issue the token in the first place. When the user hits a login page, we do state validation according to whatever policy we want to enforce, and if the client violates that policy we refuse to issue a token to it. If the token has a sufficiently short lifetime then an attacker is only going to have a short period of time to use that token before it expires and then (with luck) they won't be able to get a new one because the state validation will fail.

Except! This is fine for cases where we control the issuance flow. What if we have a scenario where a third party authenticates the client (by verifying that they have a valid token issued by their ID provider) and then uses that to issue their own token that's much longer lived? Well, now the client has a long-lived token sitting on it. And if anyone copies that token to another device, they can now pretend to be that client.

This is, sadly, depressingly common. A lot of services will verify the user, and then issue an oauth token that'll expire some time around the heat death of the universe. If a client system is compromised and an attacker just copies that token to another system, they can continue to pretend to be the legitimate user until someone notices (which, depending on whether or not the service in question has any sort of audit logs, and whether you're paying any attention to them, may be once screenshots of your data show up on Twitter).

This is a problem! There's no way to fit a hosted service that behaves this way into a Zero Trust model - the best you can say is that a token was issued to a device that was, around that time, apparently trustworthy, and now it's some time later and you have literally no idea whether the device is still trustworthy or if the token is still even on that device.

But wait, there's more! Even if you're nowhere near doing any sort of Zero Trust stuff, imagine the case of a user having a bunch of tokens from multiple services on their laptop, and then they leave their laptop unlocked in a cafe while they head to the toilet and whoops it's not there any more, better assume that someone has access to all the data on there. How many services has our opportunistic new laptop owner gained access to as a result? How do we revoke all of the tokens that are sitting there on the local disk? Do you even have a policy for dealing with that?

There isn't a simple answer to all of these problems. Replacing bearer tokens with some sort of asymmetric cryptographic challenge to the client would at least let us tie the tokens to a TPM or other secure enclave, and then we wouldn't have to worry about them being copied elsewhere. But that wouldn't help us if the client is compromised and the attacker simply keeps using the compromised client. The entire model of simply proving knowledge of a secret being sufficient to gain access to a resource is inherently incompatible with a desire for fine-grained trust verification on every access, but I don't see anything changing until we have a standard for third party services to be able to perform that trust verification against a customer's policy.

Still, at least this means I can just run weird Android IoT apps through mitmproxy, pull the bearer token out of the request headers and then start poking the remote API with curl. It may all be broken, but it's also got me a bunch of bug bounty credit, so, it;s impossible to say if its bad or not,

(Addendum: this suggestion that we solve the hardware binding problem by simply passing all the network traffic through some sort of local enclave that could see tokens being set and would then sequester them and reinject them into later requests is OBVIOUSLY HORRIFYING and is also probably going to be at least three startup pitches by the end of next week)

comment count unavailable comments

31 March 2022

Matthew Garrett: ZTA doesn't solve all problems, but partial implementations solve fewer

Traditional network access controls work by assuming that something is trustworthy based on some other factor - for example, if a computer is on your office network, it's trustworthy because only trustworthy people should be able to gain physical access to plug something in. If you restrict access to your services to requests coming from trusted networks, then you can assert that it's coming from a trusted device.

Of course, this isn't necessarily true. A machine on your office network may be compromised. An attacker may obtain valid VPN credentials. Someone could leave a hostile device plugged in under a desk in a meeting room. Trust is being placed in devices that may not be trustworthy.

A Zero Trust Architecture (ZTA) is one where a device is granted no inherent trust. Instead, each access to a service is validated against some policy - if the policy is satisfied, the access is permitted. A typical implementation involves granting each device some sort of cryptographic identity (typically a TLS client certificate) and placing the protected services behind a proxy. The proxy verifies the device identity, queries another service to obtain the current device state (we'll come back to that in a moment), compares the state against a policy and either pass the request through to the service or reject it. Different services can have different policies (eg, you probably want a lax policy around whatever's hosting the documentation for how to fix your system if it's being refused access to something for being in the wrong state), and if you want you can also tie it to proof of user identity in some way.

From a user perspective, this is entirely transparent. The proxy is made available on the public internet, DNS for the services points to the proxy, and every time your users try to access the service they hit the proxy instead and (if everything's ok) gain access to it no matter which network they're on. There's no need to connect to a VPN first, and there's no worries about accidentally leaking information over the public internet instead of over a secure link.

It's also notable that traditional solutions tend to be all-or-nothing. If I have some services that are more sensitive than others, the only way I can really enforce this is by having multiple different VPNs and only granting access to sensitive services from specific VPNs. This obviously risks combinatorial explosion once I have more than a couple of policies, and it's a terrible user experience.

Overall, ZTA approaches provide more security and an improved user experience. So why are we still using VPNs? Primarily because this is all extremely difficult. Let's take a look at an extremely recent scenario. A device used by customer support technicians was compromised. The vendor in question has a solution that can tie authentication decisions to whether or not a device has a cryptographic identity. If this was in use, and if the cryptographic identity was tied to the device hardware (eg, by being generated in a TPM), the attacker would not simply be able to obtain the user credentials and log in from their own device. This is good - if the attacker wanted to maintain access to the service, they needed to stay on the device in question. This increases the probability of the monitoring tooling on the compromised device noticing them.

Unfortunately, the attacker simply disabled the monitoring tooling on the compromised device. If device state was being verified on each access then this would be noticed before too long - the last data received from the device would be flagged as too old, and the requests would no longer satisfy any reasonable access control policy. Instead, the device was assumed to be trustworthy simply because it could demonstrate its identity. There's an important point here: just because a device belongs to you doesn't mean it's a trustworthy device.

So, if ZTA approaches are so powerful and user-friendly, why aren't we all using one? There's a few problems, but the single biggest is that there's no standardised way to verify device state in any meaningful way. Remote Attestation can both prove device identity and the device boot state, but the only product on the market that does much with this is Microsoft's Device Health Attestation. DHA doesn't solve the broader problem of also reporting runtime state - it may be able to verify that endpoint monitoring was launched, but it doesn't make assertions about whether it's still running. Right now, people are left trying to scrape this information from whatever tooling they're running. The absence of any standardised approach to this problem means anyone who wants to deploy a strong ZTA has to integrate with whatever tooling they're already running, and that then increases the cost of migrating to any other tooling later.

But even device identity is hard! Knowing whether a machine should be given a certificate or not depends on knowing whether or not you own it, and inventory control is a surprisingly difficult problem in a lot of environments. It's not even just a matter of whether a machine should be given a certificate in the first place - if a machine is reported as lost or stolen, its trust should be revoked. Your inventory system needs to tie into your device state store in order to ensure that your proxies drop access.

And, worse, all of this depends on you being able to put stuff behind a proxy in the first place! If you're using third-party hosted services, that's a problem. In the absence of a proxy, trust decisions are probably made at login time. It's possible to tie user auth decisions to device identity and state (eg, a self-hosted SAML endpoint could do that before passing through to the actual ID provider), but that's still going to end up providing a bearer token of some sort that can potentially be exfiltrated, and will continue to be trusted even if the device state becomes invalid.

ZTA doesn't solve all problems, and there isn't a clear path to it doing so without significantly greater industry support. But a complete ZTA solution is significantly more powerful than a partial one. Verifying device identity is a step on the path to ZTA, but in the absence of device state verification it's only a step.

comment count unavailable comments

23 March 2022

Matthew Garrett: AMD's Pluton implementation seems to be controllable

I've been digging through the firmware for an AMD laptop with a Ryzen 6000 that incorporates Pluton for the past couple of weeks, and I've got some rough conclusions. Note that these are extremely preliminary and may not be accurate, but I'm going to try to encourage others to look into this in more detail. For those of you at home, I'm using an image from here, specifically version 309. The installer is happy to run under Wine, and if you tell it to "Extract" rather than "Install" it'll leave a file sitting in C:\\DRIVERS\ASUS_GA402RK_309_BIOS_Update_20220322235241 which seems to have an additional 2K of header on it. Strip that and you should have something approximating a flash image.

Looking for UTF16 strings in this reveals something interesting:

Pluton (HSP) X86 Firmware Support
Enable/Disable X86 firmware HSP related code path, including AGESA HSP module, SBIOS HSP related drivers.
Auto - Depends on PcdAmdHspCoreEnable build value
NOTE: PSP directory entry 0xB BIT36 have the highest priority.
NOTE: This option will NOT put HSP hardware in disable state, to disable HSP hardware, you need setup PSP directory entry 0xB, BIT36 to 1.
// EntryValue[36] = 0: Enable, HSP core is enabled.
// EntryValue[36] = 1: Disable, HSP core is disabled then PSP will gate the HSP clock, no further PSP to HSP commands. System will boot without HSP.

"HSP" here means "Hardware Security Processor" - a generic term that refers to Pluton in this case. This is a configuration setting that determines whether Pluton is "enabled" or not - my interpretation of this is that it doesn't directly influence Pluton, but disables all mechanisms that would allow the OS to communicate with it. In this scenario, Pluton has its firmware loaded and could conceivably be functional if the OS knew how to speak to it directly, but the firmware will never speak to it itself. I took a quick look at the Windows drivers for Pluton and it looks like they won't do anything unless the firmware wants to expose Pluton, so this should mean that Windows will do nothing.

So what about the reference to "PSP directory entry 0xB BIT36 have the highest priority"? The PSP is the AMD Platform Security Processor - it's an ARM core on the CPU package that boots before the x86. The PSP firmware lives in the same flash image as the x86 firmware, so the PSP looks for a header that points it towards the firmware it should execute. This gives a pointer to a "directory" - a list of different object types and where they're located in flash (there's a description of this for slightly older AMDs here). Type 0xb is treated slightly specially. Where most types contain the address of where the actual object is, type 0xb contains a 64-bit value that's interpreted as enabling or disabling various features - something AMD calls "soft fusing" (Intel have something similar that involves setting bits in the Firmware Interface Table). The PSP looks at the bits that are set here and alters its behaviour. If bit 36 is set, the PSP tells Pluton to turn itself off and will no longer send any commands to it.

So, we have two mechanisms to disable Pluton - the PSP can tell it to turn itself off, or the x86 firmware can simply never speak to it or admit that it exists. Both of these imply that Pluton has started executing before it's shut down, so it's reasonable to wonder whether it can still do stuff. In the image I'm looking at, there's a blob starting at 0x0069b610 that appears to be firmware for Pluton - it contains chunks that appear to be the reference TPM2 implementation, and it broadly decompiles as valid ARM code. It should be viable to figure out whether it can do anything in the face of being "disabled" via either of the above mechanisms.

Unfortunately for me, the system I'm looking at does set bit 36 in the 0xb entry - as a result, Pluton is disabled before x86 code starts running and I can't investigate further in any straightforward way. The implication that the user-controllable mechanism for disabling Pluton merely disables x86 communication with it rather than turning it off entirely is a little concerning, although (assuming Pluton is behaving as a TPM rather than having an enhanced set of capabilities) skipping any firmware communication means the OS has no way to know what happened before it started running even if it has a mechanism to communicate with Pluton without firmware assistance. In that scenario it'd be viable to write a bootloader shim that just faked up the firmware measurements before handing control to the OS.

The bit 36 disabling mechanism seems more solid? Again, it should be possible to analyse the Pluton firmware to determine whether it actually pays attention to a disable command being sent. But even if it chooses to ignore that, if the PSP is in a position to just cut the clock to Pluton, it's not going to be able to do a lot. At that point we're trusting AMD rather than trusting Microsoft, but given that you're also trusting AMD to execute the code you're giving them to execute, it's hard to avoid placing trust in them.

Overall: I'm reasonably confident that systems that ship with Pluton disabled via setting bit 36 in the soft fuses are going to disable it sufficiently hard that the OS can't do anything about it. Systems that give the user an option to enable or disable it are a little less clear in that respect, and it's possible (but not yet demonstrated) that an OS could communicate with Pluton anyway. However, if that's true, and if the firmware never communicates with Pluton itself, the user could install a stub loader in UEFI that mimicks the firmware behaviour and leaves the OS thinking everything was good when it absolutely is not.

So, assuming that Pluton in its current form on AMD has no capabilities outside those we know about, the disabling mechanisms are probably good enough. It's tough to make a firm statement on this before I have access to a system that doesn't just disable it immediately, so stay tuned for updates.

comment count unavailable comments

17 January 2022

Matthew Garrett: Boot Guard and PSB have user-hostile defaults

Compromising an OS without it being detectable is hard. Modern operating systems support the imposition of a security policy or the launch of some sort of monitoring agent sufficient early in boot that even if you compromise the OS, you're probably going to have left some sort of detectable trace[1]. You can avoid this by attacking the lower layers - if you compromise the bootloader then it can just hotpatch a backdoor into the kernel before executing it, for instance.

This is avoided via one of two mechanisms. Measured boot (such as TPM-based Trusted Boot) makes a tamper-proof cryptographic record of what the system booted, with each component in turn creating a measurement of the next component in the boot chain. If a component is tampered with, its measurement will be different. This can be used to either prevent the release of a cryptographic secret if the boot chain is modified (for instance, using the TPM to encrypt the disk encryption key), or can be used to attest the boot state to another device which can tell you whether you're safe or not. The other approach is verified boot (such as UEFI Secure Boot), where each component in the boot chain verifies the next component before executing it. If the verification fails, execution halts.

In both cases, each component in the boot chain measures and/or verifies the next. But something needs to be the first link in this chain, and traditionally this was the system firmware. Which means you could tamper with the system firmware and subvert the entire process - either have the firmware patch the bootloader in RAM after measuring or verifying it, or just load a modified bootloader and lie about the measurements or ignore the verification. Attackers had already been targeting the firmware (Hacking Team had something along these lines, although this was pre-secure boot so just dropped a rootkit into the OS), and given a well-implemented measured and verified boot chain, the firmware becomes an even more attractive target.

Intel's Boot Guard and AMD's Platform Secure Boot attempt to solve this problem by moving the validation of the core system firmware to an (approximately) immutable environment. Intel's solution involves the Management Engine, a separate x86 core integrated into the motherboard chipset. The ME's boot ROM verifies a signature on its firmware before executing it, and once the ME is up it verifies that the system firmware's bootblock is signed using a public key that corresponds to a hash blown into one-time programmable fuses in the chipset. What happens next depends on policy - it can either prevent the system from booting, allow the system to boot to recover the firmware but automatically shut it down after a while, or flag the failure but allow the system to boot anyway. Most policies will also involve a measurement of the bootblock being pushed into the TPM.

AMD's Platform Secure Boot is slightly different. Rather than the root of trust living in the motherboard chipset, it's in AMD's Platform Security Processor which is incorporated directly onto the CPU die. Similar to Boot Guard, the PSP has ROM that verifies the PSP's own firmware, and then that firmware verifies the system firmware signature against a set of blown fuses in the CPU. If that fails, system boot is halted. I'm having trouble finding decent technical documentation about PSB, and what I have found doesn't mention measuring anything into the TPM - if this is the case, PSB only implements verified boot, not measured boot.

What's the practical upshot of this? The first is that you can't replace the system firmware with anything that doesn't have a valid signature, which effectively means you're locked into firmware the vendor chooses to sign. This prevents replacing the system firmware with either a replacement implementation (such as Coreboot) or a modified version of the original implementation (such as firmware that disables locking of CPU functionality or removes hardware allowlists). In this respect, enforcing system firmware verification works against the user rather than benefiting them.
Of course, it also prevents an attacker from doing the same thing, but while this is a real threat to some users, I think it's hard to say that it's a realistic threat for most users.

The problem is that vendors are shipping with Boot Guard and (increasingly) PSB enabled by default. In the AMD case this causes another problem - because the fuses are in the CPU itself, a CPU that's had PSB enabled is no longer compatible with any motherboards running firmware that wasn't signed with the same key. If a user wants to upgrade their system's CPU, they're effectively unable to sell the old one. But in both scenarios, the user's ability to control what their system is running is reduced.

As I said, the threat that these technologies seek to protect against is real. If you're a large company that handles a lot of sensitive data, you should probably worry about it. If you're a journalist or an activist dealing with governments that have a track record of targeting people like you, it should probably be part of your threat model. But otherwise, the probability of you being hit by a purely userland attack is so ludicrously high compared to you being targeted this way that it's just not a big deal.

I think there's a more reasonable tradeoff than where we've ended up. Tying things like disk encryption secrets to TPM state means that if the system firmware is measured into the TPM prior to being executed, we can at least detect that the firmware has been tampered with. In this case nothing prevents the firmware being modified, there's just a record in your TPM that it's no longer the same as it was when you encrypted the secret. So, here's what I'd suggest:

1) The default behaviour of technologies like Boot Guard or PSB should be to measure the firmware signing key and whether the firmware has a valid signature into PCR 7 (the TPM register that is also used to record which UEFI Secure Boot signing key is used to verify the bootloader).
2) If the PCR 7 value changes, the disk encryption key release will be blocked, and the user will be redirected to a key recovery process. This should include remote attestation, allowing the user to be informed that their firmware signing situation has changed.
3) Tooling should be provided to switch the policy from merely measuring to verifying, and users at meaningful risk of firmware-based attacks should be encouraged to make use of this tooling

This would allow users to replace their system firmware at will, at the cost of having to re-seal their disk encryption keys against the new TPM measurements. It would provide enough information that, in the (unlikely for most users) scenario that their firmware has actually been modified without their knowledge, they can identify that. And it would allow users who are at high risk to switch to a higher security state, and for hardware that is explicitly intended to be resilient against attacks to have different defaults.

This is frustratingly close to possible with Boot Guard, but I don't think it's quite there. Before you've blown the Boot Guard fuses, the Boot Guard policy can be read out of flash. This means that you can drop a Boot Guard configuration into flash telling the ME to measure the firmware but not prevent it from running. But there are two problems remaining:

1) The measurement is made into PCR 0, and PCR 0 changes every time your firmware is updated. That makes it a bad default for sealing encryption keys.
2) It doesn't look like the policy is measured before being enforced. This means that an attacker can simply reflash modified firmware with a policy that disables measurement and then make a fake measurement that makes it look like the firmware is ok.

Fixing this seems simple enough - the Boot Guard policy should always be measured, and measurements of the policy and the signing key should be made into a PCR other than PCR 0. If an attacker modified the policy, the PCR value would change. If an attacker modified the firmware without modifying the policy, the PCR value would also change. People who are at high risk would run an app that would blow the Boot Guard policy into fuses rather than just relying on the copy in flash, and enable verification as well as measurement. Now if an attacker tampers with the firmware, the system simply refuses to boot and the attacker doesn't get anything.

Things are harder on the AMD side. I can't find any indication that PSB supports measuring the firmware at all, which obviously makes this approach impossible. I'm somewhat surprised by that, and so wouldn't be surprised if it does do a measurement somewhere. If it doesn't, there's a rather more significant problem - if a system has a socketed CPU, and someone has sufficient physical access to replace the firmware, they can just swap out the CPU as well with one that doesn't have PSB enabled. Under normal circumstances the system firmware can detect this and prompt the user, but given that the attacker has just replaced the firmware we can assume that they'd do so with firmware that doesn't decide to tell the user what just happened. In the absence of better documentation, it's extremely hard to say that PSB actually provides meaningful security benefits.

So, overall: I think Boot Guard protects against a real-world attack that matters to a small but important set of targets. I think most of its benefits could be provided in a way that still gave users control over their system firmware, while also permitting high-risk targets to opt-in to stronger guarantees. Based on what's publicly documented about PSB, it's hard to say that it provides real-world security benefits for anyone at present. In both cases, what's actually shipping reduces the control people have over their systems, and should be considered user-hostile.

[1] Assuming that someone's both turning this on and actually looking at the data produced

comment count unavailable comments

9 January 2022

Matthew Garrett: Pluton is not (currently) a threat to software freedom

At CES this week, Lenovo announced that their new Z-series laptops would ship with AMD processors that incorporate Microsoft's Pluton security chip. There's a fair degree of cynicism around whether Microsoft have the interests of the industry as a whole at heart or not, so unsurprisingly people have voiced concerns about Pluton allowing for platform lock-in and future devices no longer booting non-Windows operating systems. Based on what we currently know, I think those concerns are understandable but misplaced.

But first it's helpful to know what Pluton actually is, and that's hard because Microsoft haven't actually provided much in the way of technical detail. The best I've found is a discussion of Pluton in the context of Azure Sphere, Microsoft's IoT security platform. This, in association with the block diagrams on page 12 and 13 of this slidedeck, suggest that Pluton is a general purpose security processor in a similar vein to Google's Titan chip. It has a relatively low powered CPU core, an RNG, and various hardware cryptography engines - there's nothing terribly surprising here, and it's pretty much the same set of components that you'd find in a standard Trusted Platform Module of the sort shipped in pretty much every modern x86 PC. But unlike Titan, Pluton seems to have been designed with the explicit goal of being incorporated into other chips, rather than being a standalone component. In the Azure Sphere case, we see it directly incorporated into a Mediatek chip. In the Xbox Series devices, it's incorporated into the SoC. And now, we're seeing it arrive on general purpose AMD CPUs.

Microsoft's announcement says that Pluton can be shipped in three configurations:as the Trusted Platform Module; as a security processor used for non-TPM scenarios like platform resiliency; or OEMs can choose to ship with Pluton turned off. What we're likely to see to begin with is the former - Pluton will run firmware that exposes a Trusted Computing Group compatible TPM interface. This is almost identical to the status quo. Microsoft have required that all Windows certified hardware ship with a TPM for years now, but for cost reasons this is often not in the form of a separate hardware component. Instead, both Intel and AMD provide support for running the TPM stack on a component separate from the main execution cores on the system - for Intel, this TPM code runs on the Management Engine integrated into the chipset, and for AMD on the Platform Security Processor that's integrated into the CPU package itself.

So in this respect, Pluton changes very little; the only difference is that the TPM code is running on hardware dedicated to that purpose, rather than alongside other code. Importantly, in this mode Pluton will not do anything unless the system firmware or OS ask it to. Pluton cannot independently block the execution of any other code - it knows nothing about the code the CPU is executing unless explicitly told about it. What the OS can certainly do is ask Pluton to verify a signature before executing code, but the OS could also just verify that signature itself. Windows can already be configured to reject software that doesn't have a valid signature. If Microsoft wanted to enforce that they could just change the default today, there's no need to wait until everyone has hardware with Pluton built-in.

The two things that seem to cause people concerns are remote attestation and the fact that Microsoft will be able to ship firmware updates to Pluton via Windows Update. I've written about remote attestation before, so won't go into too many details here, but the short summary is that it's a mechanism that allows your system to prove to a remote site that it booted a specific set of code. What's important to note here is that the TPM (Pluton, in the scenario we're talking about) can't do this on its own - remote attestation can only be triggered with the aid of the operating system. Microsoft's Device Health Attestation is an example of remote attestation in action, and the technology definitely allows remote sites to refuse to grant you access unless you booted a specific set of software. But there are two important things to note here: first, remote attestation cannot prevent you from booting whatever software you want, and second, as evidenced by Microsoft already having a remote attestation product, you don't need Pluton to do this! Remote attestation has been possible since TPMs started shipping over two decades ago.

The other concern is Microsoft having control over the firmware updates. The context here is that TPMs are not magically free of bugs, and sometimes these can have security consequences. One example is Infineon TPMs producing weak RSA keys, a vulnerability that could be rectified by a firmware update to the TPM. Unfortunately these updates had to be issued by the device manufacturer rather than Infineon being able to do so directly. This meant users had to wait for their vendor to get around to shipping an update, something that might not happen at all if the machine was sufficiently old. From a security perspective, being able to ship firmware updates for the TPM without them having to go through the device manufacturer is a huge win.

Microsoft's obviously in a position to ship a firmware update that modifies the TPM's behaviour - there would be no technical barrier to them shipping code that resulted in the TPM just handing out your disk encryption secret on demand. But Microsoft already control the operating system, so they already have your disk encryption secret. There's no need for them to backdoor the TPM to give them something that the TPM's happy to give them anyway. If you don't trust Microsoft then you probably shouldn't be running Windows, and if you're not running Windows Microsoft can't update the firmware on your TPM.

So, as of now, Pluton running firmware that makes it look like a TPM just isn't a terribly interesting change to where we are already. It can't block you running software (either apps or operating systems). It doesn't enable any new privacy concerns. There's no mechanism for Microsoft to forcibly push updates to it if you're not running Windows.

Could this change in future? Potentially. Microsoft mention another use-case for Pluton "as a security processor used for non-TPM scenarios like platform resiliency", but don't go into any more detail. At this point, we don't know the full set of capabilities that Pluton has. Can it DMA? Could it play a role in firmware authentication? There are scenarios where, in theory, a component such as Pluton could be used in ways that would make it more difficult to run arbitrary code. It would be reassuring to hear more about what the non-TPM scenarios are expected to look like and what capabilities Pluton actually has.

But let's not lose sight of something more fundamental here. If Microsoft wanted to block free operating systems from new hardware, they could simply mandate that vendors remove the ability to disable secure boot or modify the key databases. If Microsoft wanted to prevent users from being able to run arbitrary applications, they could just ship an update to Windows that enforced signing requirements. If they want to be hostile to free software, they don't need Pluton to do it.

(Edit: it's been pointed out that I kind of gloss over the fact that remote attestation is a potential threat to free software, as it theoretically allows sites to block access based on which OS you're running. There's various reasons I don't think this is realistic - one is that there's just way too much variability in measurements for it to be practical to write a policy that's strict enough to offer useful guarantees without also blocking a number of legitimate users, and the other is that you can just pass the request through to a machine that is running the appropriate software and have it attest for you. The fact that nobody has actually bothered to use remote attestation for this purpose even though most consumer systems already ship with TPMs suggests that people generally agree with me on that)

comment count unavailable comments

31 December 2021

Matthew Garrett: Update on Linux hibernation support when lockdown is enabled

Some time back I wrote up a description of my proposed (and implemented) solution for making hibernation work under Linux even within the bounds of the integrity model. It's been a while, so here's an update.

The first is that localities just aren't an option. It turns out that they're optional in the spec, and TPMs are entirely permitted to say they don't support them. The only time they're likely to work is on platforms that support DRTM implementations like TXT. Most consumer hardware doesn't fall into that category, so we don't get to use that solution. Unfortunate, but, well.

The second is that I'd ignored an attack vector. If the kernel is configured to restrict access to PCR 23, then yes, an attacker is never able to modify PCR 23 to be in the same state it would be if hibernation were occurring and the key certification data will fail to validate. Unfortunately, an attacker could simply boot into an older kernel that didn't implement the PCR 23 restriction, and could fake things up there (yes, this is getting a bit convoluted, but the entire point here is to make this impossible rather than just awkward). Once PCR 23 was in the correct state, they would then be able to write out a new swap image, boot into a new kernel that supported the secure hibernation solution, and have that resume successfully in the (incorrect) belief that the image was written out in a secure environment.

This felt like an awkward problem to fix. We need to be able to distinguish between the kernel having modified the PCRs and userland having modified the PCRs, and we need to be able to do this without modifying any kernels that have already been released[1]. The normal approach to determining whether an event occurred in a specific phase of the boot process is to "cap" the PCR - extend it with a known value that indicates a transition between stages of the boot process. Any events that occur before the cap event must have occurred in the previous stage of boot, and since the final PCR value depends on the order of measurements and not just the contents of those measurements, if a PCR is capped before userland runs, userland can't fake the same PCR value afterwards. If Linux capped a PCR before userland started running, we'd be able to place a measurement there before the cap occurred and then prove that that extension occurred before userland had the opportunity to interfere. We could simply place a statement that the kernel supported the PCR 23 restrictions there, and we'd be fine.

Unfortunately Linux doesn't currently do this, and adding support for doing so doesn't fix the problem - if an attacker boots a kernel that doesn't cap a PCR, they can just cap it themselves from userland. So, we're faced with the same problem: booting an older kernel allows the system to be placed in an identical state to the current kernel, and a fake hibernation image can be written out. Solving this required a PCR that was being modified after kernel code was running, but before userland was started, even with existing kernels.

Thankfully, there is one! PCR 5 is defined as containing measurements related to boot management configuration and data. One of the measurements it contains is the result of the UEFI ExitBootServices() call. ExitBootServices() is called at the transition from the UEFI boot environment to the running OS, and the kernel contains code that executes before it. So, if we measure an assertion regarding whether or not we support restricted access to PCR 23 into PCR 5 before we call ExitBootServices(), this will prevent userspace from spoofing us (because userspace will only be able to extend PCR 5 after the firmware extended PCR 5 in response to ExitBootServices() being called). Obviously this depends on the firmware actually performing the PCR 5 extension when ExitBootServices() is called, but if firmware's out of spec then I don't think there's any real expectation of it being secure enough for any of this to buy you anything anyway.

My current tree is here, but there's a couple of things I want to do before submitting it, including ensuring that the key material is wiped from RAM after use (otherwise it could potentially be scraped out and used to generate another image afterwards) and, uh, actually making sure this works (I no longer have the machine I was previously using for testing, and switching my other dev machine over to TPM 2 firmware is proving troublesome, so I need to pull another machine out of the stack and reimage it).

[1] The linear nature of time makes feature development much more frustrating

comment count unavailable comments

21 September 2021

Russell Coker: Links September 2021

Matthew Garrett wrote an interesting and insightful blog post about the license of software developed or co-developed by machine-learning systems [1]. One of his main points is that people in the FOSS community should aim for less copyright protection. The USENIX ATC 21/OSDI 21 Joint Keynote Address titled It s Time for Operating Systems to Rediscover Hardware has some inssightful points to make [2]. Timothy Roscoe makes some incendiaty points but backs them up with evidence. Is Linux really an OS? I recommend that everyone who s interested in OS design watch this lecture. Cory Doctorow wrote an interesting set of 6 articles about Disneyland, ride pricing, and crowd control [3]. He proposes some interesting ideas for reforming Disneyland. Benjamin Bratton wrote an insightful article about how philosophy failed in the pandemic [4]. He focuses on the Italian philosopher Giorgio Agamben who has a history of writing stupid articles that match Qanon talking points but with better language skills. Arstechnica has an interesting article about penetration testers extracting an encryption key from the bus used by the TPM on a laptop [5]. It s not a likely attack in the real world as most networks can be broken more easily by other methods. But it s still interesting to learn about how the technology works. The Portalist has an article about David Brin s Startide Rising series of novels and his thought s on the concept of Uplift (which he denies inventing) [6]. Jacobin has an insightful article titled You re Not Lazy But Your Boss Wants You to Think You Are [7]. Making people identify as lazy is bad for them and bad for getting them to do work. But this is the first time I ve seen it described as a facet of abusive capitalism. Jacobin has an insightful article about free public transport [8]. Apparently there are already many regions that have free public transport (Tallinn the Capital of Estonia being one example). Fare free public transport allows bus drivers to concentrate on driving not taking fares, removes the need for ticket inspectors, and generally provides a better service. It allows passengers to board buses and trams faster thus reducing traffic congestion and encourages more people to use public transport instead of driving and reduces road maintenance costs. Interesting research from Israel about bypassing facial ID [9]. Apparently they can make a set of 9 images that can pass for over 40% of the population. I didn t expect facial recognition to be an effective form of authentication, but I didn t expect it to be that bad. Edward Snowden wrote an insightful blog post about types of conspiracies [10]. Kevin Rudd wrote an informative article about Sky News in Australia [11]. We need to have a Royal Commission now before we have our own 6th Jan event. Steve from Big Mess O Wires wrote an informative blog post about USB-C and 4K 60Hz video [12]. Basically you can t have a single USB-C hub do 4K 60Hz video and be a USB 3.x hub unless you have compression software running on your PC (slow and only works on Windows), or have DisplayPort 1.4 or Thunderbolt (both not well supported). All of the options are not well documented on online store pages so lots of people will get unpleasant surprises when their deliveries arrive. Computers suck. Steinar H. Gunderson wrote an informative blog post about GaN technology for smaller power supplies [13]. A 65W USB-C PSU that fits the usual wall wart form factor is an interesting development.

13 July 2021

Matthew Garrett: Does free software benefit from ML models being derived works of training data?

Github recently announced Copilot, a machine learning system that makes suggestions for you when you're writing code. It's apparently trained on all public code hosted on Github, which means there's a lot of free software in its training set. Github assert that the output of Copilot belongs to the user, although they admit that it may occasionally produce output that is identical to content from the training set.

Unsurprisingly, this has led to a number of questions along the lines of "If Copilot embeds code that is identical to GPLed training data, is my code now GPLed?". This is extremely understandable, but the underlying issue is actually more general than that. Even code under permissive licenses like BSD requires retention of copyright notices and disclaimers, and failing to include them is just as much a copyright violation as incorporating GPLed code into a work and not abiding by the terms of the GPL is.

But free software licenses only have power to the extent that copyright permits them to. If your code isn't a derived work of GPLed material, you have no obligation to follow the terms of the GPL. Github clearly believe that Copilot's output doesn't count as a derived work as far as US copyright law goes, and as a result the licenses on the training data don't apply to the output. Some people have interpreted this as an attack on free software - Copilot may insert code that's either identical or extremely similar to GPLed code, and claim that there are no license obligations created as a result, effectively allowing the laundering of GPLed code into proprietary software.

I'm completely unqualified to hold a strong opinion on whether Github's legal position is justifiable or not, and right now I'm also not interested in thinking about it too much. What I think is more interesting is what the impact of either position has on free software. Do we benefit more from a future where the output of Copilot (or similar projects) is considered a derived work of the training data, or one where it isn't? Having been involved in a bunch of GPL enforcement activities, it's very easy to think of this as something that weakens the GPL and, as a result, weakens free software. That was my initial reaction, but that's shifted over the past few days.

Let's look at the GNU manifesto, specifically this section:

The fact that the easiest way to copy a program is from one neighbor to another, the fact that a program has both source code and object code which are distinct, and the fact that a program is used rather than read and enjoyed, combine to create a situation in which a person who enforces a copyright is harming society as a whole both materially and spiritually; in which a person should not do so regardless of whether the law enables him to.

The GPL makes use of copyright law to ensure that GPLed work can't be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren't copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn't exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.

The powers that the GPL uses to enforce sharing of code are used by the authors of proprietary software to reduce that sharing. They attempt to forbid us from examining their code to determine how it works - they argue that anyone who does so is tainted, unable to contribute similar code to free software projects in case they produce a derived work of the original. Broadly speaking, the further the definition of a derived work reaches, the greater the power of proprietary software authors. If Oracle's argument that APIs are copyrightable had prevailed, it would have been disastrous for free software. If the Apple look and feel suit had established that Microsoft infringed Apple's copyright, we might be living in a future where we had no free software desktop environments.

When we argue for an interpretation of copyright law that enhances the power of the GPL, we're also enhancing the power of giant corporations with a lot of lawyers on hand. So let's look at this another way. If Github's interpretation of copyright law holds, we can train a model on proprietary code and extract concepts without having to worry about being tainted. The proprietary code itself won't enter the commons, but the ideas it embodies will. No more worries about whether you're literally copying the code that implements an algorithm you want to duplicate - simply start typing and let the model remove the risk for you.

There's a reasonable counter argument about equality here. How much GPL-influenced code is going to end up in proprietary projects when compared to the reverse? It's not an easy question to answer, but we should bear in mind that the majority of public repositories on Github aren't under an open source license. Copilot is already claiming to give us access to the concepts embodied in those repositories. Do these provide more value than is given up? I honestly don't know how to measure that. But what I do know is that free software was founded in a belief that software shouldn't be constrained by copyright, and our default stance shouldn't be to argue against the idea that copyright is weaker than we imagined.

(Edit: this post by Julia Reda makes some of the same arguments, but spends some more time focusing on a legal analysis of why having copyright cover the output of Copilot would be a problem)

comment count unavailable comments

4 June 2021

Matthew Garrett: Mike Lindell's Cyber "Evidence"

Mike Lindell, notable for absolutely nothing relevant in this field, today filed a lawsuit against a couple of voting machine manufacturers in response to them suing him for defamation after he claimed that they were covering up hacks that had altered the course of the US election. Paragraph 104 of his suit asserts that he has evidence of at least 20 documented hacks, including the number of votes that were changed. The citation is just a link to a video called Absolute 9-0, which claims to present sufficient evidence that the US supreme court will come to a 9-0 decision that the election was tampered with.

The claim is that Lindell was provided with a set of files on the 9th of January, and gave these to some cyber experts to verify. These experts identified them as packet captures. The video contains scrolling hex, and we are told that this is the raw encrypted data from the files. In reality, the hex values correspond very clearly to printable ASCII, and appear to just be the Pennsylvania voter roll. They're not encrypted, and they're not packet captures (they contain no packet headers).

20 of these packet captures were then selected and analysed, giving us the tables contained within Exhibit 12. The alleged source IPs appear to correspond to the networks the tables claim, and the latitude and longitude presumably just come from a geoip lookup of some sort (although clearly those values are far too precise to be accurate). But if we look at the target IPs, we find something interesting. Most of them resolve to the website for the county that was the nominal target (eg, 198.108.253.104 is www.deltacountymi.org). So, we're supposed to believe that in many cases, the county voting infrastructure was hosted on the county website.

Unfortunately we're not given the destination port, but 198.108.253.104 isn't listening on anything other than 80 and 443. We're told that the packet data is encrypted, so presumably it's over HTTPS. So, uh, how did they decrypt this to figure out how many votes were switched? If Mike's hackers have broken TLS, they really don't need to be dealing with this.

We're also given some background information on how it's impossible to reconstruct packet captures after the fact (untrue), or that modifying them would change their hashes (true, but in the absence of known good hash values that tells us nothing), but it's pretty clear that nothing we're shown actually demonstrates what we're told it does.

In summary: yes, any supreme court decision on this would be 9-0, just not the way he's hoping for.

Update: It was pointed out that this data appears to be part of a larger dataset. This one is even more dubious - it somehow has MAC addresses for both the source and destination (which is impossible), and almost none of these addresses are in actual issued ranges.

comment count unavailable comments

2 June 2021

Matthew Garrett: Producing a trustworthy x86-based Linux appliance

Let's say you're building some form of appliance on top of general purpose x86 hardware. You want to be able to verify the software it's running hasn't been tampered with. What's the best approach with existing technology?

Let's split this into two separate problems. The first is to do as much as we can to ensure that the software can't be modified without our consent[1]. This requires that each component in the boot chain verify that the next component is legitimate. We call the first component in this chain the root of trust, and in the x86 world this is the system firmware[2]. This firmware is responsible for verifying the bootloader, and the easiest way to do this on x86 is to use UEFI Secure Boot. In this setup the firmware contains a set of trusted signing certificates and will only boot executables with a chain of trust to one of these certificates. Switching the system into setup mode from the firmware menu will allow you to remove the existing keys and install new ones.

(Note: You shouldn't use the trusted certificate directly for signing bootloaders - instead, the trusted certificate should be used to sign another certificate and the key for that certificate used to sign your bootloader. This way, if you ever need to revoke the signing certificate, you can simply sign a new one with the trusted parent and push out a revocation update instead of having to provision new keys)

But what do you want to sign? In the general purpose Linux world, we use an intermediate bootloader called Shim to bridge from the Microsoft signing authority to a distribution one. Shim then verifies the signature on grub, and grub in turn verifies the signature on the kernel. This is a large body of code that exists because of the use cases that general purpose distributions need to support - primarily, booting on arbitrary off the shelf hardware, and allowing arbitrary and complicated boot setups. This is unnecessary in the appliance case, where the hardware target can be well defined, where there's no need for interoperability with the Microsoft signing authority, and where the boot configuration can be extremely static.

We can skip all of this complexity using systemd-boot's unified Linux image support. This has the format described here, but the short version is that it's simply a kernel and initramfs linked into a small EFI executable that will run them. Instructions for generating such an image are here, and if you follow them you'll end up with a single static image that can be directly executed by the firmware. Signing this avoids dealing with a whole host of problems associated with relying on shim and grub, but note that you'll be embedding the initramfs as well. Again, this should be fine for appliance use-cases, but you'll need your build system to support building the initramfs at image creation time rather than relying on it being generated on the host.

At this point we have a single image that can be verified by the firmware and will get us to the point of a running kernel and initramfs. Unless you've got enough RAM that you can put your entire workload in the initramfs, you're going to want a filesystem as well, and you're going to want to verify that that filesystem hasn't been tampered with. The easiest approach to this is to use dm-verity, a device-mapper layer that uses a hash tree to verify that the filesystem contents haven't been modified. The kernel needs to know what the root hash is, so this can either be embedded into your initramfs image or into the kernel command line. Either way, it'll end up in the signed boot image, so nobody will be able to tamper with it.

It's important to note that a dm-verity partition is read-only - the kernel doesn't have the cryptographic secret that would be required to generate a new hash tree if the partition is modified. So if you require the ability to write data or logs anywhere, you'll need to add a new partition for that. If this partition is unencrypted, an attacker with access to the device will be able to put whatever they want on there. You should treat any data you read from there as untrusted, and ensure that it's validated before use (ie, don't just feed it to a random parser written in C and expect that everything's going to be ok). On the other hand, if it's encrypted, remember that you can't just put the encryption key in the boot image - an attacker with access to the device is going to be able to dump that and extract it. You'll probably want to use a TPM-sealed encryption secret, which will be discussed later on.

At this point everything in the boot process is cryptographically verified, and so should be difficult to tamper with. Unfortunately this isn't really sufficient - on x86 systems there's typically no verification of the integrity of the secure boot database. An attacker with physical access to the system could attach a programmer directly to the firmware flash and rewrite the secure boot database to include keys they control. They could then replace the boot image with one that they've signed, and the machine would happily boot code that the attacker controlled. We need to be able to demonstrate that the system booted using the correct secure boot keys, and the only way we can do that is to use the TPM.

I wrote an introduction to TPMs a while back. The important thing to know here is that the TPM contains a set of Platform Configuration Registers that are large enough to contain a cryptographic hash. During boot, each component of the boot process will generate a "measurement" of other security critical components, including the next component to be booted. These measurements are a representation of the data in question - they may simply be a hash of the object being measured, or the hash of a structure containing various pieces of metadata. Each measurement is passed to the TPM, along with the PCR it should be measured into. The TPM takes the new measurement, appends it to the existing value, and then stores the hash of this concatenated data in the PCR. This means that the final PCR value depends not only on the measurement, but also on every previous measurement. Without breaking the hash algorithm, there's no way to set the PCR to an arbitrary value. The hash values and some associated data are stored in a log that's kept in system RAM, which we'll come back to later.

Different PCRs store different pieces of information, but the one that's most interesting to us is PCR 7. Its use is documented in the TCG PC Client Platform Firmware Profile (section 3.3.4.8), but the short version is that the firmware will measure the secure boot keys that are used to boot the system. If the secure boot keys are altered (such as by an attacker flashing new ones), the PCR 7 value will change.

What can we do with this? There's a couple of choices. For devices that are online, we can perform remote attestation, a process where the device can provide a signed copy of the PCR values to another system. If the system also provides a copy of the TPM event log, the individual events in the log can be replayed in the same way that the TPM would use to calculate the PCR values, and then compared to the actual PCR values. If they match, that implies that the log values are correct, and we can then analyse individual log entries to make assumptions about system state. If a device has been tampered with, the PCR 7 values and associated log entries won't match the expected values, and we can detect the tampering.

If a device is offline, or if there's a need to permit local verification of the device state, we still have options. First, we can perform remote attestation to a local device. I demonstrated doing this over Bluetooth at LCA back in 2020. Alternatively, we can take advantage of other TPM features. TPMs can be configured to store secrets or keys in a way that renders them inaccessible unless a chosen set of PCRs have specific values. This is used in tpm2-totp, which uses a secret stored in the TPM to generate a TOTP value. If the same secret is enrolled in any standard TOTP app, the value generated by the machine can be compared to the value in the app. If they match, the PCR values the secret was sealed to are unmodified. If they don't, or if no numbers are generated at all, that demonstrates that PCR 7 is no longer the same value, and that the system has been tampered with.

Unfortunately, TOTP requires that both sides have possession of the same secret. This is fine when a user is making that association themselves, but works less well if you need some way to ship the secret on a machine and then separately ship the secret to a user. If the user can simply download the secret via some API, so can an attacker. If an attacker has the secret, they can modify the secure boot database and re-seal the secret to the new PCR 7 value. That means having to add some form of authentication, along with a strong binding of machine serial number to a user (in order to avoid someone with valid credentials simply downloading all the secrets).

Instead, we probably want some mechanism that uses asymmetric cryptography. A keypair can be generated on the TPM, which will refuse to release an unencrypted copy of the private key. The public key, however, can be exported and stored. If it's acceptable for a verification app to connect to the internet then the public key can simply be obtained that way - if not, a certificate can be issued to the key, and this exposed to the verifier via a QR code. The app then verifies that the certificate is signed by the vendor, and if so extracts the public key from that. The private key can have an associated policy that only permits its use when PCR 7 has an appropriate value, so the app then generates a nonce and asks the user to type that into the device. The device generates a signature over that nonce and displays that as a QR code. The app verifies the signature matches, and can then assert that PCR 7 has the expected value.

Once we can assert that PCR 7 has the expected value, we can assert that the system booted something signed by us and thus infer that the rest of the boot chain is also secure. But this is still dependent on the TPM obtaining trustworthy information, and unfortunately the bus that the TPM sits on isn't really terribly secure (TPM Genie is an example of an interposer for i2c-connected TPMs, but there's no reason an LPC one can't be constructed to attack the sort usually used on PCs). TPMs do support encrypted communication channels, but bootstrapping those isn't straightforward without firmware support. The easiest way around this is to make use of a firmware-based TPM, where the TPM is implemented in software running on an ancillary controller. Intel's solution is part of their Platform Trust Technology and runs on the Management Engine, AMD run it on the Platform Security Processor. In both cases it's not terribly feasible to intercept the communications, so we avoid this attack. The downside is that we're then placing more trust in components that are running much more code than a TPM would and which have a correspondingly larger attack surface. Which is preferable is going to depend on your threat model.

Most of this should be achievable using Yocto, which now has support for dm-verity built in. It's almost certainly going to be easier using this than trying to base on top of a general purpose distribution. I'd love to see this become a largely push button receive secure image process, so might take a go at that if I have some free time in the near future.

[1] Obviously technologies that can be used to ensure nobody other than me is able to modify the software on devices I own can also be used to ensure that nobody other than the manufacturer is able to modify the software on devices that they sell to third parties. There's no real technological solution to this problem, but we shouldn't allow the fact that a technology can be used in ways that are hostile to user freedom to cause us to reject that technology outright.
[2] This is slightly complicated due to the interactions with the Management Engine (on Intel) or the Platform Security Processor (on AMD). Here's a good writeup on the Intel side of things.

comment count unavailable comments

24 May 2021

Antoine Beaupr : Leaving Freenode

The freenode IRC network has been hijacked. TL;DR: move to libera.chat or OFTC.net, as did countless free software projects including Gentoo, CentOS, KDE, Wikipedia, FOSDEM, and more. Debian and the Tor project were already on OFTC and are not affected by this.

What is freenode and why should I care? freenode is the largest remaining IRC network. Before this incident, it had close to 80,000 users, which is small in terms of modern internet history -- even small social networks are larger by multiple orders of magnitude -- but is large in IRC history. The IRC network is also extensively used by the free software community, being the default IRC network on many programs, and used by hundreds if not thousands of free software projects. I have been using freenode since at least 2006. This matters if you care about IRC, the internet, open protocols, decentralisation, and, to a certain extent, federation as well. It also touches on who has the right on network resources: the people who "own" it (through money) or the people who make it work (through their labor). I am biased towards open protocols, the internet, federation, and worker power, and this might taint this analysis.

What happened? It's a long story, but basically:
  1. back in 2017, the former head of staff sold the freenode.net domain (and its related company) to Andrew Lee, "American entrepreneur, software developer and writer", and, rather weirdly, supposedly "crown prince of Korea" although that part is kind of complex (see House of Yi, Yi Won, and Yi Seok). It should be noted the Korean Empire hasn't existed for over a century at this point (even though its flag, also weirdly, remains)
  2. back then, this was only known to the public as this strange PIA and freenode joining forces gimmick. it was suspicious at first, but since the network kept running, no one paid much attention to it. opers of the network were similarly reassured that Lee would have no say in the management of the network
  3. this all changed recently when Lee asserted ownership of the freenode.net domain and started meddling in the operations of the network, according to this summary. this part is disputed, but it is corroborated by almost a dozen former staff which collectively resigned from the network in protest, after legal threats, when it was obvious freenode was lost.
  4. the departing freenode staff founded a new network, irc.libera.chat, based on the new ircd they were working on with OFTC, solanum
  5. meanwhile, bot armies started attacking all IRC networks: both libera and freenode, but also OFTC and unrelated networks like a small one I help operate. those attacks have mostly stopped as of this writing (2021-05-24 17:30UTC)
  6. on freenode, however, things are going for the worse: Lee has been accused of taking over a channel, in a grotesque abuse of power; then changing freenode policy to not only justify the abuse, but also remove rules against hateful speech, effectively allowing nazis on the network (update: the change was reverted, but not by Lee)
Update: even though the policy change was reverted, the actual conversations allowed on freenode have already degenerated into toxic garbage. There are also massive channel takeovers (presumably over 700), mostly on channels that were redirecting to libera, but also channels that were still live. Channels that were taken over include #fosdem, #wikipedia, #haskell... Instead of working on the network, the new "so-called freenode" staff is spending effort writing bots and patches to basically automate taking over channels. I run an IRC network and this bot is obviously not standard "services" stuff... This is just grotesque. At this point I agree with this HN comment:
We should stop implicitly legitimizing Andrew Lee's power grab by referring to his dominion as "Freenode". Freenode is a quarter-century-old community that has changed its name to libera.chat; the thing being referred to here as "Freenode" is something else that has illegitimately acquired control of Freenode's old servers and user database, causing enormous inconvenience to the real Freenode.
I don't agree with the suggested name there, let's instead call it "so called freenode" as suggested later in the thread.

What now? I recommend people and organisations move away from freenode as soon as possible. This is a major change: documentation needs to be fixed, and the migration needs to be coordinated. But I do not believe we can trust the new freenode "owners" to operate the network reliably and in good faith. It's also important to use the current momentum to build a critical mass elsewhere so that people don't end up on freenode again by default and find an even more toxic community than your typical run-of-the-mill free software project (which is already not a high bar to meet). Update: people are moving to libera in droves. It's now reaching 18,000 users, which is bigger than OFTC and getting close to the largest, traditionnal, IRC networks (EFnet, Undernet, IRCnet are in the 10-20k users range). so-called freenode is still larger, currently clocking 68,000 users, but that's a huge drop from the previous count which was 78,000 before the exodus began. We're even starting to see the effects of the migration on netsplit.de. Update 2: the isfreenodedeadyet.com site is updated more frequently than netsplit and shows tons more information. It shows 25k online users for libera and 61k for so-called freenode (down from ~78k), and the trend doesn't seem to be stopping for so-called freenode. There's also a list of 400+ channels that have moved out. Keep in mind that such migrations take effect over long periods of time.

Where do I move to? The first thing you should do is to figure out which tool to use for interactive user support. There are multiple alternatives, of course -- this is the internet after all -- but here is a short list of suggestions, in preferred priority order:
  1. irc.libera.chat
  2. irc.OFTC.net
  3. Matrix.org, which bridges with OFTC and (hopefully soon) with libera as well, modern IRC alternative
  4. XMPP/Jabber also still exists, if you're into that kind of stuff, but I don't think the "chat room" story is great there, at least not as good as Matrix
Basically, the decision tree is this:
  • if you want to stay on IRC:
    • if you are already on many OFTC channels and few freenode channels: move to OFTC
    • if you are more inclined to support the previous freenode staff: move to libera
    • if you care about matrix users (in the short term): move to OFTC
  • if you are ready to leave IRC:
    • if you want the latest and greatest: move to Matrix
    • if you like XML and already use XMPP: move to XMPP
Frankly, at this point, everyone should seriously consider moving to Matrix. The user story is great, the web is a first class user, it supports E2EE (although XMPP as well), and has a lot of momentum behind it. It even bridges with IRC well (which is not the case for XMPP) so if you're worried about problems like this happening again. (Indeed, I wouldn't be surprised if similar drama happens on OFTC or libera in the future. The history of IRC is full of such epic controversies, takeovers, sabotage, attacks, technical flamewars, and other silly things. I am not sure, but I suspect a federated model like Matrix might be more resilient to conflicts like this one.) Changing protocols might mean losing a bunch of users however: not everyone is ready to move to Matrix, for example. Graybeards like me have been using irssi for years, if not decades, and would take quite a bit of convincing to move elsewhere. I have mostly kept my channels on IRC, and moved either to OFTC or libera. In retrospect, I think I might have moved everything to OFTC if I would have thought about it more, because almost all of my channels are there. But I kind of expect a lot of the freenode community to move to libera, so I am keeping a socket open there anyways.

How do I move? The first thing you should do is to update documentation, websites, and source code to stop pointing at freenode altogether. This is what I did for feed2exec, for example. You need to let people know in the current channel as well, and possibly shutdown the channel on freenode. Since my channels are either small or empty, I took the radical approach of:
  • redirecting the channel to ##unavailable which is historically the way we show channels have moved to another network
  • make the channel invite-only (which effectively enforces the redirection)
  • kicking everyone out of the channel
  • kickban people who rejoin
  • set the topic to announce the change
In IRC speak, the following commands should do all this:
/msg ChanServ set #anarcat mlock +if ##unavailable
/msg ChanServ clear #anarcat users moving to irc.libera.chat
/msg ChanServ set #anarcat restricted on
/topic #anarcat this channel has moved to irc.libera.chat
If the channel is not registered, the following might work
/mode #anarcat +if ##unavailable
Then you can leave freenode altogether:
/disconnect Freenode unacceptable hijack, policy changes and takeovers. so long and thanks for all the fish.
Keep in mind that some people have been unable to setup such redirections, because the new freenode staff have taken over their channel, in which case you're out of luck... Some people have expressed concern about their private data hosted at freenode as well. If you care about this, you can always talk to NickServ and DROP your nick. Be warned, however, that this assumes good faith of the network operators, which, at this point, is kind of futile. I would assume any data you have registered on there (typically: your NickServ password and email address) to be compromised and leaked. If your password is used elsewhere (tsk, tsk), change it everywhere. Update: there's also another procedure, similar to the above, but with a different approach. Keep in mind that so-called freenode staff are actively hijacking channels for the mere act of mentioning libera in the channel topic, so thread carefully there.

Last words This is a sad time for IRC in general, and freenode in particular. It's a real shame that the previous freenode staff have been kicked out, and it's especially horrible that if the new policies of the network are basically making the network open to nazis. I wish things would have gone out differently: now we have yet another fork in the IRC history. While it's not the first time freenode changes name (it was called OPN before), now the old freenode is still around and this will bring much confusion to the world, especially since the new freenode staff is still claiming to support FOSS. I understand there are many sides to this story, and some people were deeply hurt by all this. But for me, it's completely unacceptable to keep pushing your staff so hard that they basically all (except one?) resign in protest. For me, that's leadership failure at the utmost, and a complete disgrace. And of course, I can't in good conscience support or join a network that allows hate speech. Regardless of the fate of whatever we'll call what's left of freenode, maybe it's time for this old IRC thing to die already. It's still a sad day in internet history, but then again, maybe IRC will never die...

6 May 2021

Matthew Garrett: More doorbell adventures

Back in my last post on this topic, I'd got shell on my doorbell but hadn't figured out why the HTTP callbacks weren't always firing. I still haven't, but I have learned some more things.

Doorbird sell a chime, a network connected device that is signalled by the doorbell when someone pushes a button. It costs about $150, which seems excessive, but would solve my problem (ie, that if someone pushes the doorbell and I'm not paying attention to my phone, I miss it entirely). But given a shell on the doorbell, how hard could it be to figure out how to mimic the behaviour of one?

Configuration for the doorbell is all stored under /mnt/flash, and there's a bunch of files prefixed 1000eyes that contain config (1000eyes is the German company that seems to be behind Doorbird). One of these was called 1000eyes.peripherals, which seemed like a good starting point. The initial contents were "Peripherals":[] , so it seemed likely that it was intended to be JSON. Unfortunately, since I had no access to any of the peripherals, I had no idea what the format was. I threw the main application into Ghidra and found a function that had debug statements referencing "initPeripherals and read a bunch of JSON keys out of the file, so I could simply look at the keys it referenced and write out a file based on that. I did so, and it didn't work - the app stubbornly refused to believe that there were any defined peripherals. The check that was failing was pcVar4 = strstr(local_50[0],PTR_s_"type":"_0007c980);, which made no sense, since I very definitely had a type key in there. And then I read it more closely. strstr() wasn't being asked to look for "type":, it was being asked to look for "type":". I'd left a space between the : and the opening " in the value, which meant it wasn't matching. The rest of the function seems to call an actual JSON parser, so I have no idea why it doesn't just use that for this part as well, but deleting the space and restarting the service meant it now believed I had a peripheral attached.

The mobile app that's used for configuring the doorbell now showed a device in the peripherals tab, but it had a weird corrupted name. Tapping it resulted in an error telling me that the device was unavailable, and on the doorbell itself generated a log message showing it was trying to reach a device with the hostname bha-04f0212c5cca and (unsurprisingly) failing. The hostname was being generated from the MAC address field in the peripherals file and was presumably supposed to be resolved using mDNS, but for now I just threw a static entry in /etc/hosts pointing at my Home Assistant device. That was enough to show that when I opened the app the doorbell was trying to call a CGI script called peripherals.cgi on my fake chime. When that failed, it called out to the cloud API to ask it to ask the chime[1] instead. Since the cloud was completely unaware of my fake device, this didn't work either. I hacked together a simple server using Python's HTTPServer and was able to return data (another block of JSON). This got me to the point where the app would now let me get to the chime config, but would then immediately exit. adb logcat showed a traceback in the app caused by a failed assertion due to a missing key in the JSON, so I ran the app through jadx, found the assertion and from there figured out what keys I needed. Once that was done, the app opened the config page just fine.

Unfortunately, though, I couldn't edit the config. Whenever I hit "save" the app would tell me that the peripheral wasn't responding. This was strange, since the doorbell wasn't even trying to hit my fake chime. It turned out that the app was making a CGI call to the doorbell, and the thread handling that call was segfaulting just after reading the peripheral config file. This suggested that the format of my JSON was probably wrong and that the doorbell was not handling that gracefully, but trying to figure out what the format should actually be didn't seem easy and none of my attempts improved things.

So, new approach. Rather than writing the config myself, why not let the doorbell do it? I should be able to use the genuine pairing process if I could mimic the chime sufficiently well. Hitting the "add" button in the app asked me for the username and password for the chime, so I typed in something random in the expected format (six characters followed by four zeroes) and a sufficiently long password and hit ok. A few seconds later it told me it couldn't find the device, which wasn't unexpected. What was a little more unexpected was that the log on the doorbell was showing it trying to hit another bha-prefixed hostname (and, obviously, failing). The hostname contains the MAC address, but I hadn't told the doorbell the MAC address of the chime, just its username. Some more digging showed that the doorbell was calling out to the cloud API, giving it the 6 character prefix from the username and getting a MAC address back. Doing the same myself revealed that there was a straightforward mapping from the prefix to the mac address - changing the final character from "a" to "b" incremented the MAC by one. It's actually just a base 26 encoding of the MAC, with aaaaaa translating to 00408C000000.

That explained how the hostname was being generated, and in return I was able to work backwards to figure out which username I should use to generate the hostname I was already using. Attempting to add it now resulted in the doorbell making another CGI call to my fake chime in order to query its feature set, and by mocking that up as well I was able to send back a file containing X-Intercom-Type, X-Intercom-TypeId and X-Intercom-Class fields that made the doorbell happy. I now had a valid JSON file, which cleared up a couple of mysteries. The corrupt name was because the name field isn't supposed to be ASCII - it's base64 encoded UTF16-BE. And the reason I hadn't been able to figure out the JSON format correctly was because it looked something like this:

"Peripherals":[] "prefix": "type":"DoorChime","name":"AEQAbwBvAHIAYwBoAGkAbQBlACAAVABlAHMAdA==","mac":"04f0212c5cca","user":"username","password":"password" ]


Note that there's a total of one [ in this file, but two ]s? Awesome. Anyway, I could now modify the config in the app and hit save, and the doorbell would then call out to my fake chime to push config to it. Weirdly, the association between the chime and a specific button on the doorbell is only stored on the chime, not on the doorbell. Further, hitting the doorbell didn't result in any more HTTP traffic to my fake chime. However, it did result in some broadcast UDP traffic being generated. Searching for the port number led me to the Doorbird LAN API and a complete description of the format and encryption mechanism in use. Argon2I is used to turn the first five characters of the chime's password (which is also stored on the doorbell itself) into a 256-bit key, and this is used with ChaCha20 to decrypt the payload. The payload then contains a six character field describing the device sending the event, and then another field describing the event itself. Some more scrappy Python and I could pick up these packets and decrypt them, which showed that they were being sent whenever any event occurred on the doorbell. This explained why there was no storage of the button/chime association on the doorbell itself - the doorbell sends packets for all events, and the chime is responsible for deciding whether to act on them or not.

On closer examination, it turns out that these packets aren't just sent if there's a configured chime. One is sent for each configured user, avoiding the need for a cloud round trip if your phone is on the same network as the doorbell at the time. There was literally no need for me to mimic the chime at all, suitable events were already being sent.

Still. There's a fair amount of WTFery here, ranging from the strstr() based JSON parsing, the invalid JSON, the symmetric encryption that uses device passwords as the key (requiring the doorbell to be aware of the chime's password) and the use of only the first five characters of the password as input to the KDF. It doesn't give me a great deal of confidence in the rest of the device's security, so I'm going to keep playing.

[1] This seems to be to handle the case where the chime isn't on the same network as the doorbell

comment count unavailable comments

23 April 2021

Matthew Garrett: An accidental bootsplash

Back in 2005 we had Debconf in Helsinki. Earlier in the year I'd ended up invited to Canonical's Ubuntu Down Under event in Sydney, and one of the things we'd tried to design was a reasonable graphical boot environment that could also display status messages. The design constraints were awkward - we wanted it to be entirely in userland (so we didn't need to carry kernel patches), and we didn't want to rely on vesafb[1] (because at the time we needed to reinitialise graphics hardware from userland on suspend/resume[2], and vesa was not super compatible with that). Nothing currently met our requirements, but by the time we'd got to Helsinki there was a general understanding that Paul Sladen was going to implement this.

The Helsinki Debconf ended being an extremely strange event, involving me having to explain to Mark Shuttleworth what the physics of a bomb exploding on a bus were, many people being traumatised by the whole sauna situation, and the whole unfortunate water balloon incident, but it also involved Sladen spending a bunch of time trying to produce an SVG of a London bus as a D-Bus logo and not really writing our hypothetical userland bootsplash program, so on the last night, fueled by Koff that we'd bought by just collecting all the discarded empty bottles and returning them for the deposits, I started writing one.

I knew that Debian was already using graphics mode for installation despite having a textual installer, because they needed to deal with more complex fonts than VGA could manage. Digging into the code, I found that it used BOGL - a graphics library that made use of the VGA framebuffer to draw things. VGA had a pre-allocated memory range for the framebuffer[3], which meant the firmware probably wouldn't map anything else there any hitting those addresses probably wouldn't break anything. This seemed safe.

A few hours later, I had some code that could use BOGL to print status messages to the screen of a machine booted with vga16fb. I woke up some time later, somehow found myself in an airport, and while sitting at the departure gate[4] I spent a while staring at VGA documentation and worked out which magical calls I needed to make to have it behave roughly like a linear framebuffer. Shortly before I got on my flight back to the UK, I had something that could also draw a graphical picture.

Usplash shipped shortly afterwards. We hit various issues - vga16fb produced a 640x480 mode, and some laptops were not inclined to do that without a BIOS call first. 640x400 worked basically everywhere, but meant we had to redraw the art because circles don't work the same way if you change the resolution. My brief "UBUNTU BETA" artwork that was me literally writing "UBUNTU BETA" on an HP TC1100 shortly after I'd got the Wacom screen working did not go down well, and thankfully we had better artwork before release.

But 16 colours is somewhat limiting. SVGALib offered a way to get more colours and better resolution in userland, retaining our prerequisites. Unfortunately it relied on VM86, which doesn't exist in 64-bit mode on Intel systems. I ended up hacking the X.org x86emu into a thunk library that exposed the same API as LRMI, so we could run it without needing VM86. Shockingly, it worked - we had support for 256 colour bootsplashes in any supported resolution on 64 bit systems as well as 32 bit ones.

But by now it was obvious that the future was having the kernel manage graphics support, both in terms of native programming and in supporting suspend/resume. Plymouth is much more fully featured than Usplash ever was, but relies on functionality that simply didn't exist when we started this adventure. There's certainly an argument that we'd have been better off making reasonable kernel modesetting support happen faster, but at this point I had literally no idea how to write decent kernel code and everyone should be happy I kept this to userland.

Anyway. The moral of all of this is that sometimes history works out such that you write some software that a huge number of people run without any idea of who you are, and also that this can happen without you having any fucking idea what you're doing.

Write code. Do crimes.

[1] vesafb relied on either the bootloader or the early stage kernel performing a VBE call to set a mode, and then just drawing directly into that framebuffer. When we were doing GPU reinitialisation in userland we couldn't guarantee that we'd run before the kernel tried to draw stuff into that framebuffer, and there was a risk that that was mapped to something dangerous if the GPU hadn't been reprogrammed into the same state. It turns out that having GPU modesetting in the kernel is a Good Thing.

[2] ACPI didn't guarantee that the firmware would reinitialise the graphics hardware, and as a result most machines didn't. At this point Linux didn't have native support for initialising most graphics hardware, so we fell back to doing it from userland. VBEtool was a terrible hack I wrote to try to re-execute the system's graphics hardware through a range of mechanisms, and it worked in a surprising number of cases.

[3] As long as you were willing to deal with 640x480 in 16 colours

[4] Helsinki-Vantaan had astonishingly comfortable seating for time

comment count unavailable comments

15 March 2021

Matthew Garrett: Exploring my doorbell

I've talked about my doorbell before, but started looking at it again this week because sometimes it simply doesn't send notifications to my Home Assistant setup - the push notifications appear on my phone, but the doorbell simply doesn't trigger the HTTP callback it's meant to[1]. This is obviously suboptimal, but it's also tricky to debug a device when you have no access to it.

Normally I'd just head straight in with a screwdriver, but the doorbell is shared with the other units in this building and it seemed a little anti-social to interfere with a shared resource. So I bought some broken units from ebay and pulled one of them apart. There's several boards inside, but one of them had a conveniently empty connector at the top with "TX", "RX" and "GND" labelled. Sticking a USB-serial converter on this gave me output from U-Boot, and then kernel output. Confirmation that my doorbell runs Linux, but unfortunately it didn't give me a shell prompt. My next approach would often me to just dump the flash and look for vulnerabilities that way, but this device uses TSOP-48 packaged NAND flash rather than the more convenient SPI NOR flash that I already have adapters to access. Dumping this sort of NAND isn't terribly hard, but the easiest way to do it involves desoldering it from the board and plugging it into something like a Flashcat USB adapter, and my soldering's not good enough to put it back on the board afterwards. So I wanted another approach.

U-Boot gave a short countdown to hit a key before continuing with boot, and for once hitting a key actually did something. Unfortunately it then prompted for a password, and giving the wrong one resulted in boot continuing[2]. In the past I've had good luck forcing U-Boot to drop to a prompt by simply connecting one of the data lines on SPI flash to ground while it's trying to read the kernel - the failed read causes U-Boot to error out. It turns out the same works fine on raw NAND, so I just edited the kernel boot arguments to append "init=/bin/sh" and soon I had a shell.

From here on, things were made easier by virtue of the device using the YAFFS filesystem. Unlike many flash filesystems, it's read/write, so I could make changes that would persist through to the running system. There was a convenient copy of telnetd included, but it segfaulted on startup, which reduced its usefulness. Fortunately there was also a copy of Netcat[3]. If you make a fifo somewhere on the filesystem, you can cat the fifo to a shell, pipe the shell to a netcat listener, and then pipe netcat's output back to the fifo. The shell's output all gets passed to whatever connects to netcat, and whatever's sent to netcat gets passed through the fifo back to the shell. This is, obviously, horribly insecure, but it was enough to get a root shell over the network on the running device.

The doorbell runs various bits of software, one of which is Lighttpd to provide a local API and access to the device. Another component ("nxp-client") connects to the vendor's cloud infrastructure and passes cloud commands back to the local webserver. This is where I found something strange. Lighttpd was refusing to start because its modules wanted library symbols that simply weren't present on the device. My best guess is that a firmware update went wrong and left the device in a partially upgraded state - and without a working local webserver, there was no way to perform any further updates. This may explain why this doorbell was sitting on ebay.

Anyway. Now that I had shell, I could simply dump the flash by copying it directly off the /dev/mtdblock devices - since I had netcat, I could just pipe stuff through that back to my actual computer. Now I had access to the filesystem I could extract that locally and start digging into it more deeply. One incredibly useful tool for this is qemu-user. qemu is a general purpose hardware emulation platform, usually used to emulate entire systems. But in qemu-user mode, it instead only emulates the CPU. When a piece of code tries to make a system call to access the kernel, qemu-user translates that to the appropriate calling convention for the host kernel and makes that call instead. Combined with binfmt_misc, you can configure a Linux system to be able to run Linux binaries from other architectures. One of the best things about this is that, because they're still using the host convention for making syscalls, you can run the host strace on them and see what they're doing.

What I found was that nxp-client was calling back to the cloud platform, setting up an encrypted communication channel (using ChaCha20 and a bunch of key setup stuff I couldn't be bothered picking apart) and then waiting for commands from the cloud. It would then proxy those through to the local webserver. Since I couldn't run the local lighttpd, I just wrote a trivial Python app using http.server and waited to see what requests I got. The first was a GET to a CGI script called editcgi.cgi, along with a path name. I mocked up the GET request to respond with what was on the actual filesystem. The cloud then proceeded to POST to editcgi.cgi, with the same pathname and with new file contents. editcgi.cgi is apparently able to read and write to files on the filesystem.

But this is on the interface that's exposed to the cloud client, so this didn't appear immediately useful - and, indeed, trying to hit the same CGI binary over the local network gave me a 401 unauthorized error. There's a local API spec for these doorbells, but they all refer to scripts in the bha-api namespace, and this script was in the plain cgi-bin namespace. But then I noticed that the bha-api namespace didn't actually exist in the filesystem - instead, lighttpd's mod_alias was configured to rewrite requests to bha-api through to files in cgi-bin. And by using the documented API to get a session token, I could call editcgi.cgi to read and write arbitrary files on the doorbell. Which means I can drop an extra script in /etc/rc.d/rc3.d and get a shell on my doorbell.

This all requires the ability to have local authentication credentials, so it's not a big security deal other than it allowing you to retain access to a monitoring device even after you've moved out and had your credentials revoked. I'm sure it's all fine.

[1] I can ping the doorbell from the Home Assistant machine, so it's not that the network is flaky
[2] The password appears to be hy9$gnhw0z6@ if anyone else ends up in this situation
[3] https://twitter.com/mjg59/status/654578208545751040

comment count unavailable comments

9 March 2021

Matthew Garrett: Unauthenticated MQTT endpoints on Linksys Velop routers enable local DoS

(Edit: this is CVE-2021-1000002)

Linksys produces a series of wifi mesh routers under the Velop line. These routers use MQTT to send messages to each other for coordination purposes. In the version I tested against, there was zero authentication on this - anyone on the local network is able to connect to the MQTT interface on a router and send commands. As an example:
mosquitto_pub -h 192.168.1.1 -t "network/master/cmd/nodes_temporary_blacklist" -m ' "data": "client": "f8:16:54:43:e2:0c", "duration": "3600", "action": "start" '
will ask the router to block the client with MAC address f8:16:54:43:e2:0c from the network for an hour. Various other MQTT topics pass parameters to shell scripts without quoting them or escaping metacharacters, so more serious outcomes may be possible.

The vendor has released two firmware updates since report - I have not verified whether either fixes this, but the changelog does not indicate any security issues were addressed.

Timeline:

2020-07-30: Submitted through the vendor's security vulnerability report form, indicating that I plan to disclose in either 90 days or after a fix is released. The form turns out to file a Bugcrowd submission.
2020-07-30: I claim the Bugcrowd submission.
2020-08-19: Vendor acknowledges the issue, is able to reproduce and assigns it a P3 priority.
2020-12-15: I ask if there's an update.
2021-02-02: I ask if there's an update.
2021-02-03: Bugcrowd raise a blocker on the issue, asking the vendor to respond.
2021-02-17: I ask for permission to disclose.
2021-03-09: In the absence of any response from the vendor since 2020-08-19, I violate Bugcrowd disclosure policies and unilaterally disclose.

comment count unavailable comments

21 February 2021

Matthew Garrett: Making hibernation work under Linux Lockdown

Linux draws a distinction between code running in kernel (kernel space) and applications running in userland (user space). This is enforced at the hardware level - in x86-speak[1], kernel space code runs in ring 0 and user space code runs in ring 3[2]. If you're running in ring 3 and you attempt to touch memory that's only accessible in ring 0, the hardware will raise a fault. No matter how privileged your ring 3 code, you don't get to touch ring 0.

Kind of. In theory. Traditionally this wasn't well enforced. At the most basic level, since root can load kernel modules, you could just build a kernel module that performed any kernel modifications you wanted and then have root load it. Technically user space code wasn't modifying kernel space code, but the difference was pretty semantic rather than useful. But it got worse - root could also map memory ranges belonging to PCI devices[3], and if the device could perform DMA you could just ask the device to overwrite bits of the kernel[4]. Or root could modify special CPU registers ("Model Specific Registers", or MSRs) that alter CPU behaviour via the /dev/msr interface, and compromise the kernel boundary that way.

It turns out that there were a number of ways root was effectively equivalent to ring 0, and the boundary was more about reliability (ie, a process running as root that ends up misbehaving should still only be able to crash itself rather than taking down the kernel with it) than security. After all, if you were root you could just replace the on-disk kernel with a backdoored one and reboot. Going deeper, you could replace the bootloader with one that automatically injected backdoors into a legitimate kernel image. We didn't have any way to prevent this sort of thing, so attempting to harden the root/kernel boundary wasn't especially interesting.

In 2012 Microsoft started requiring vendors ship systems with UEFI Secure Boot, a firmware feature that allowed[5] systems to refuse to boot anything without an appropriate signature. This not only enabled the creation of a system that drew a strong boundary between root and kernel, it arguably required one - what's the point of restricting what the firmware will stick in ring 0 if root can just throw more code in there afterwards? What ended up as the Lockdown Linux Security Module provides the tooling for this, blocking userspace interfaces that can be used to modify the kernel and enforcing that any modules have a trusted signature.

But that comes at something of a cost. Most of the features that Lockdown blocks are fairly niche, so the direct impact of having it enabled is small. Except that it also blocks hibernation[6], and it turns out some people were using that. The obvious question is "what does hibernation have to do with keeping root out of kernel space", and the answer is a little convoluted and is tied into how Linux implements hibernation. Basically, Linux saves system state into the swap partition and modifies the header to indicate that there's a hibernation image there instead of swap. On the next boot, the kernel sees the header indicating that it's a hibernation image, copies the contents of the swap partition back into RAM, and then jumps back into the old kernel code. What ensures that the hibernation image was actually written out by the kernel? Absolutely nothing, which means a motivated attacker with root access could turn off swap, write a hibernation image to the swap partition themselves, and then reboot. The kernel would happily resume into the attacker's image, giving the attacker control over what gets copied back into kernel space.

This is annoying, because normally when we think about attacks on swap we mitigate it by requiring an encrypted swap partition. But in this case, our attacker is root, and so already has access to the plaintext version of the swap partition. Disk encryption doesn't save us here. We need some way to verify that the hibernation image was written out by the kernel, not by root. And thankfully we have some tools for that.

Trusted Platform Modules (TPMs) are cryptographic coprocessors[7] capable of doing things like generating encryption keys and then encrypting things with them. You can ask a TPM to encrypt something with a key that's tied to that specific TPM - the OS has no access to the decryption key, and nor does any other TPM. So we can have the kernel generate an encryption key, encrypt part of the hibernation image with it, and then have the TPM encrypt it. We store the encrypted copy of the key in the hibernation image as well. On resume, the kernel reads the encrypted copy of the key, passes it to the TPM, gets the decrypted copy back and is able to verify the hibernation image.

That's great! Except root can do exactly the same thing. This tells us the hibernation image was generated on this machine, but doesn't tell us that it was done by the kernel. We need some way to be able to differentiate between keys that were generated in kernel and ones that were generated in userland. TPMs have the concept of "localities" (effectively privilege levels) that would be perfect for this. Userland is only able to access locality 0, so the kernel could simply use locality 1 to encrypt the key. Unfortunately, despite trying pretty hard, I've been unable to get localities to work. The motherboard chipset on my test machines simply doesn't forward any accesses to the TPM unless they're for locality 0. I needed another approach.

TPMs have a set of Platform Configuration Registers (PCRs), intended for keeping a record of system state. The OS isn't able to modify the PCRs directly. Instead, the OS provides a cryptographic hash of some material to the TPM. The TPM takes the existing PCR value, appends the new hash to that, and then stores the hash of the combination in the PCR - a process called "extension". This means that the new value of the TPM depends not only on the value of the new data, it depends on the previous value of the PCR - and, in turn, that previous value depended on its previous value, and so on. The only way to get to a specific PCR value is to either (a) break the hash algorithm, or (b) perform exactly the same sequence of writes. On system reset the PCRs go back to a known value, and the entire process starts again.

Some PCRs are different. PCR 23, for example, can be reset back to its original value without resetting the system. We can make use of that. The first thing we need to do is to prevent userland from being able to reset or extend PCR 23 itself. All TPM accesses go through the kernel, so this is a simple matter of parsing the write before it's sent to the TPM and returning an error if it's a sensitive command that would touch PCR 23. We now know that any change in PCR 23's state will be restricted to the kernel.

When we encrypt material with the TPM, we can ask it to record the PCR state. This is given back to us as metadata accompanying the encrypted secret. Along with the metadata is an additional signature created by the TPM, which can be used to prove that the metadata is both legitimate and associated with this specific encrypted data. In our case, that means we know what the value of PCR 23 was when we encrypted the key. That means that if we simply extend PCR 23 with a known value in-kernel before encrypting our key, we can look at the value of PCR 23 in the metadata. If it matches, the key was encrypted by the kernel - userland can create its own key, but it has no way to extend PCR 23 to the appropriate value first. We now know that the key was generated by the kernel.

But what if the attacker is able to gain access to the encrypted key? Let's say a kernel bug is hit that prevents hibernation from resuming, and you boot back up without wiping the hibernation image. Root can then read the key from the partition, ask the TPM to decrypt it, and then use that to create a new hibernation image. We probably want to prevent that as well. Fortunately, when you ask the TPM to encrypt something, you can ask that the TPM only decrypt it if the PCRs have specific values. "Sealing" material to the TPM in this way allows you to block decryption if the system isn't in the desired state. So, we define a policy that says that PCR 23 must have the same value at resume as it did on hibernation. On resume, the kernel resets PCR 23, extends it to the same value it did during hibernation, and then attempts to decrypt the key. Afterwards, it resets PCR 23 back to the initial value. Even if an attacker gains access to the encrypted copy of the key, the TPM will refuse to decrypt it.

And that's what this patchset implements. There's one fairly significant flaw at the moment, which is simply that an attacker can just reboot into an older kernel that doesn't implement the PCR 23 blocking and set up state by hand. Fortunately, this can be avoided using another aspect of the boot process. When you boot something via UEFI Secure Boot, the signing key used to verify the booted code is measured into PCR 7 by the system firmware. In the Linux world, the Shim bootloader then measures any additional keys that are used. By either using a new key to tag kernels that have support for the PCR 23 restrictions, or by embedding some additional metadata in the kernel that indicates the presence of this feature and measuring that, we can have a PCR 7 value that verifies that the PCR 23 restrictions are present. We then seal the key to PCR 7 as well as PCR 23, and if an attacker boots into a kernel that doesn't have this feature the PCR 7 value will be different and the TPM will refuse to decrypt the secret.

While there's a whole bunch of complexity here, the process should be entirely transparent to the user. The current implementation requires a TPM 2, and I'm not certain whether TPM 1.2 provides all the features necessary to do this properly - if so, extending it shouldn't be hard, but also all systems shipped in the past few years should have a TPM 2, so that's going to depend on whether there's sufficient interest to justify the work. And we're also at the early days of review, so there's always the risk that I've missed something obvious and there are terrible holes in this. And, well, given that it took almost 8 years to get the Lockdown patchset into mainline, let's not assume that I'm good at landing security code.

[1] Other architectures use different terminology here, such as "supervisor" and "user" mode, but it's broadly equivalent
[2] In theory rings 1 and 2 would allow you to run drivers with privileges somewhere between full kernel access and userland applications, but in reality we just don't talk about them in polite company
[3] This is how graphics worked in Linux before kernel modesetting turned up. XFree86 would just map your GPU's registers into userland and poke them directly. This was not a huge win for stability
[4] IOMMUs can help you here, by restricting the memory PCI devices can DMA to or from. The kernel then gets to allocate ranges for device buffers and configure the IOMMU such that the device can't DMA to anything else. Except that region of memory may still contain sensitive material such as function pointers, and attacks like this can still cause you problems as a result.
[5] This describes why I'm using "allowed" rather than "required" here
[6] Saving the system state to disk and powering down the platform entirely - significantly slower than suspending the system while keeping state in RAM, but also resilient against the system losing power.
[7] With some handwaving around "coprocessor". TPMs can't be part of the OS or the system firmware, but they don't technically need to be an independent component. Intel have a TPM implementation that runs on the Management Engine, a separate processor built into the motherboard chipset. AMD have one that runs on the Platform Security Processor, a small ARM core built into their CPU. Various ARM implementations run a TPM in Trustzone, a special CPU mode that (in theory) is able to access resources that are entirely blocked off from anything running in the OS, kernel or otherwise.

comment count unavailable comments

27 July 2020

Matthew Garrett: Filesystem deduplication is a sidechannel

First off - nothing I'm going to talk about in this post is novel or overly surprising, I just haven't found a clear writeup of it before. I'm not criticising any design decisions or claiming this is an important issue, just raising something that people might otherwise be unaware of.

With that out of the way: Automatic deduplication of data is a feature of modern filesystems like zfs and btrfs. It takes two forms - inline, where the filesystem detects that data being written to disk is identical to data that already exists on disk and simply references the existing copy rather than, and offline, where tooling retroactively identifies duplicated data and removes the duplicate copies (zfs supports inline deduplication, btrfs only currently supports offline). In a world where disks end up with multiple copies of cloud or container images, deduplication can free up significant amounts of disk space.

What's the security implication? The problem is that deduplication doesn't recognise ownership - if two users have copies of the same file, only one copy of the file will be stored[1]. So, if user a stores a file, the amount of free space will decrease. If user b stores another copy of the same file, the amount of free space will remain the same. If user b is able to check how much free space is available, user b can determine whether the file already exists.

This doesn't seem like a huge deal in most cases, but it is a violation of expected behaviour (if user b doesn't have permission to read user a's files, user b shouldn't be able to determine whether user a has a specific file). But we can come up with some convoluted cases where it becomes more relevant, such as law enforcement gaining unprivileged access to a system and then being able to demonstrate that a specific file already exists on that system. Perhaps more interestingly, it's been demonstrated that free space isn't the only sidechannel exposed by deduplication - deduplication has an impact on access timing, and can be used to infer the existence of data across virtual machine boundaries.

As I said, this is almost certainly not something that matters in most real world scenarios. But with so much discussion of CPU sidechannels over the past couple of years, it's interesting to think about what other features also end up leaking information in ways that may not be obvious.

(Edit to add: deduplication isn't enabled on zfs by default and is explicitly triggered on btrfs, so unless it's something you've enabled then this isn't something that affects you)

[1] Deduplication is usually done at the block level rather than the file level, but given zfs's support for variable sized blocks, identical files should be deduplicated even if they're smaller than the maximum record size

comment count unavailable comments

24 June 2020

Matthew Garrett: Making my doorbell work

I recently moved house, and the new building has a Doorbird to act as a doorbell and open the entrance gate for people. There's a documented local control API (no cloud dependency!) and a Home Assistant integration, so this seemed pretty straightforward.

Unfortunately not. The Doorbird is on separate network that's shared across the building, provided by Monkeybrains. We're also a Monkeybrains customer, so our network connection is plugged into the same router and antenna as the Doorbird one. And, as is common, there's port isolation between the networks in order to avoid leakage of information between customers. Rather perversely, we are the only people with an internet connection who are unable to ping my doorbell.

I spent most of the past few weeks digging myself out from under a pile of boxes, but we'd finally reached the point where spending some time figuring out a solution to this seemed reasonable. I spent a while playing with port forwarding, but that wasn't ideal - the only server I run is in the UK, and having packets round trip almost 11,000 miles so I could speak to something a few metres away seemed like a bad plan. Then I tried tethering an old Android device with a data-only SIM, which worked fine but only in one direction (I could see what the doorbell could see, but I couldn't get notifications that someone had pushed a button, which was kind of the point here).

So I went with the obvious solution - I added a wifi access point to the doorbell network, and my home automation machine now exists on two networks simultaneously (nmcli device modify wlan0 ipv4.never-default true is the magic for "ignore the gateway that the DHCP server gives you" if you want to avoid this), and I could now do link local service discovery to find the doorbell if it changed addresses after a power cut or anything. And then, like magic, everything worked - I got notifications from the doorbell when someone hit our button.

But knowing that an event occurred without actually doing something in response seems fairly unhelpful. I have a bunch of Chromecast targets around the house (a mixture of Google Home devices and Chromecast Audios), so just pushing a message to them seemed like the easiest approach. Home Assistant has a text to speech integration that can call out to various services to turn some text into a sample, and then push that to a media player on the local network. You can group multiple Chromecast audio sinks into a group that then presents as a separate device on the network, so I could then write an automation to push audio to the speaker group in response to the button being pressed.

That's nice, but it'd also be nice to do something in response. The Doorbird exposes API control of the gate latch, and Home Assistant exposes that as a switch. I'm using Home Assistant's Google Assistant integration to expose devices Home Assistant knows about to voice control. Which means when I get a house-wide notification that someone's at the door I can just ask Google to open the door for them.

So. Someone pushes the doorbell. That sends a signal to a machine that's bridged onto that network via an access point. That machine then sends a protobuf command to speakers on a separate network, asking them to stream a sample it's providing. Those speakers call back to that machine, grab the sample and play it. At this point, multiple speakers in the house say "Someone is at the door". I then say "Hey Google, activate the front gate" - the device I'm closest to picks this up and sends it to Google, where something turns my speech back into text. It then looks at my home structure data and realises that the "Front Gate" device is associated with my Home Assistant integration. It then calls out to the home automation machine that received the notification in the first place, asking it to trigger the front gate relay. That device calls out to the Doorbird and asks it to open the gate. And now I have functionality equivalent to a doorbell that completes a circuit and rings a bell inside my home, and a button inside my home that completes a circuit and opens the gate, except it involves two networks inside my building, callouts to the cloud, at least 7 devices inside my home that are running Linux and I really don't want to know how many computational cycles.

The future is wonderful.

(I work for Google. I do not work on any of the products described in this post. Please god do not ask me how to integrate your IoT into any of this)

comment count unavailable comments

Next.