Search Results: "matthew"

9 April 2024

Matthew Palmer: How I Tripped Over the Debian Weak Keys Vulnerability

Those of you who haven t been in IT for far, far too long might not know that next month will be the 16th(!) anniversary of the disclosure of what was, at the time, a fairly earth-shattering revelation: that for about 18 months, the Debian OpenSSL package was generating entirely predictable private keys. The recent xz-stential threat (thanks to @nixCraft for making me aware of that one), has got me thinking about my own serendipitous interaction with a major vulnerability. Given that the statute of limitations has (probably) run out, I thought I d share it as a tale of how huh, that s weird can be a powerful threat-hunting tool but only if you ve got the time to keep pulling at the thread.

Prelude to an Adventure Our story begins back in March 2008. I was working at Engine Yard (EY), a now largely-forgotten Rails-focused hosting company, which pioneered several advances in Rails application deployment. Probably EY s greatest claim to lasting fame is that they helped launch a little code hosting platform you might have heard of, by providing them free infrastructure when they were little more than a glimmer in the Internet s eye. I am, of course, talking about everyone s favourite Microsoft product: GitHub. Since GitHub was in the right place, at the right time, with a compelling product offering, they quickly started to gain traction, and grow their userbase. With growth comes challenges, amongst them the one we re focusing on today: SSH login times. Then, as now, GitHub provided SSH access to the git repos they hosted, by SSHing to `git@github.com` with publickey authentication. They were using the standard way that everyone manages SSH keys: the `~/.ssh/authorized_keys` file, and that became a problem as the number of keys started to grow. The way that SSH uses this file is that, when a user connects and asks for publickey authentication, SSH opens the `~/.ssh/authorized_keys` file and scans all of the keys listed in it, looking for a key which matches the key that the user presented. This linear search is normally not a huge problem, because nobody in their right mind puts more than a few keys in their `~/.ssh/authorized_keys`, right?

Of course, as a popular, rapidly-growing service, GitHub was gaining users at a fair clip, to the point that the one big file that stored all the SSH keys was starting to visibly impact SSH login times. This problem was also not going to get any better by itself. Something Had To Be Done. EY management was keen on making sure GitHub ran well, and so despite it not really being a hosting problem, they were willing to help fix this problem. For some reason, the late, great, Ezra Zygmuntowitz pointed GitHub in my direction, and let me take the time to really get into the problem with the GitHub team. After examining a variety of different possible solutions, we came to the conclusion that the least-worst option was to patch OpenSSH to lookup keys in a MySQL database, indexed on the key fingerprint. We didn t take this decision on a whim it wasn t a case of yeah, sure, let s just hack around with OpenSSH, what could possibly go wrong? . We knew it was potentially catastrophic if things went sideways, so you can imagine how much worse the other options available were. Ensuring that this wouldn t compromise security was a lot of the effort that went into the change. In the end, though, we rolled it out in early April, and lo! SSH logins were fast, and we were pretty sure we wouldn t have to worry about this problem for a long time to come. Normally, you d think patching OpenSSH to make mass SSH logins super fast would be a good story on its own. But no, this is just the opening scene.

Chekov s Gun Makes its Appearance Fast forward a little under a month, to the first few days of May 2008. I get a message from one of the GitHub team, saying that somehow users were able to access other users repos over SSH. Naturally, as we d recently rolled out the OpenSSH patch, which touched this very thing, the code I d written was suspect number one, so I was called in to help.
They're called The Usual Suspects for a reason, but sometimes, it really is Keyser S ze
Eventually, after more than a little debugging, we discovered that, somehow, there were two users with keys that had the same key fingerprint. This absolutely shouldn t happen it s a bit like winning the lottery twice in a row¹ unless the users had somehow shared their keys with each other, of course. Still, it was worth investigating, just in case it was a web application bug, so the GitHub team reached out to the users impacted, to try and figure out what was going on. The users professed no knowledge of each other, neither admitted to publicising their key, and couldn t offer any explanation as to how the other person could possibly have gotten their key. Then things went from weird to what the ? . Because another pair of users showed up, sharing a key fingerprint but it was a different shared key fingerprint. The odds now have gone from winning the lottery multiple times in a row to as close to this literally cannot happen as makes no difference.

Once we were really, really confident that the OpenSSH patch wasn t the cause of the problem, my involvement in the problem basically ended. I wasn t a GitHub employee, and EY had plenty of other customers who needed my help, so I wasn t able to stay deeply involved in the on-going investigation of The Mystery of the Duplicate Keys. However, the GitHub team did keep talking to the users involved, and managed to determine the only apparent common factor was that all the users claimed to be using Debian or Ubuntu systems, which was where their SSH keys would have been generated. That was as far as the investigation had really gotten, when along came May 13, 2008.

Chekov s Gun Goes Off With the publication of DSA-1571-1, everything suddenly became clear. Through a well-meaning but ultimately disasterous cleanup of OpenSSL s randomness generation code, the Debian maintainer had inadvertently reduced the number of possible keys that could be generated by a given user from bazillions to a little over 32,000. With so many people signing up to GitHub some of them no doubt following best practice and freshly generating a separate key it s unsurprising that some collisions occurred. You can imagine the sense of oooooooh, so that s what s going on! that rippled out once the issue was understood. I was mostly glad that we had conclusive evidence that my OpenSSH patch wasn t at fault, little knowing how much more contact I was to have with Debian weak keys in the future, running a huge store of known-compromised keys and using them to find misbehaving Certificate Authorities, amongst other things.

Lessons Learned While I ve not found a description of exactly when and how Luciano Bello discovered the vulnerability that became CVE-2008-0166, I presume he first came across it some time before it was disclosed likely before GitHub tripped over it. The stable Debian release that included the vulnerable code had been released a year earlier, so there was plenty of time for Luciano to have discovered key collisions and go hmm, I wonder what s going on here? , then keep digging until the solution presented itself. The thought hmm, that s odd , followed by intense investigation, leading to the discovery of a major flaw is also what ultimately brought down the recent XZ backdoor. The critical part of that sequence is the ability to do that intense investigation, though. When I reflect on my brush with the Debian weak keys vulnerability, what sticks out to me is the fact that I didn t do the deep investigation. I wonder if Luciano hadn t found it, how long it might have been before it was found. The GitHub team would have continued investigating, presumably, and perhaps they (or I) would have eventually dug deep enough to find it. But we were all super busy myself, working support tickets at EY, and GitHub feverishly building features and fighting the fires in their rapidly-growing service. As it was, Luciano was able to take the time to dig in and find out what was happening, but just like the XZ backdoor, I feel like we, as an industry, got a bit lucky that someone with the skills, time, and energy was on hand at the right time to make a huge difference. It s a luxury to be able to take the time to really dig into a problem, and it s a luxury that most of us rarely have. Perhaps an understated takeaway is that somehow we all need to wrestle back some time to follow our hunches and really dig into the things that make us go hmm .

Support My Hunches If you d like to help me be able to do intense investigations of mysterious software phenomena, you can shout me a refreshing beverage on ko-fi.

the odds are actually probably more like winning the lottery about twenty times in a row. The numbers involved are staggeringly huge, so it s easiest to just approximate it as really, really unlikely .

14 March 2024

Matthew Garrett: Digital forgeries are hard

Closing arguments in the trial between various people and Craig Wright over whether he's Satoshi Nakamoto are wrapping up today, amongst a bewildering array of presented evidence. But one utterly astonishing aspect of this lawsuit is that expert witnesses for both sides agreed that much of the digital evidence provided by Craig Wright was unreliable in one way or another, generally including indications that it wasn't produced at the point in time it claimed to be. And it's fascinating reading through the subtle (and, in some cases, not so subtle) ways that that's revealed.

One of the pieces of evidence entered is screenshots of data from Mind Your Own Business, a business management product that's been around for some time. Craig Wright relied on screenshots of various entries from this product to support his claims around having controlled meaningful number of bitcoin before he was publicly linked to being Satoshi. If these were authentic then they'd be strong evidence linking him to the mining of coins before Bitcoin's public availability. Unfortunately the screenshots themselves weren't contemporary - the metadata shows them being created in 2020. This wouldn't fundamentally be a problem (it's entirely reasonable to create new screenshots of old material), as long as it's possible to establish that the material shown in the screenshots was created at that point. Sadly, well.

One part of the disclosed information was an email that contained a zip file that contained a raw database in the format used by MYOB. Importing that into the tool allowed an audit record to be extracted - this record showed that the relevant entries had been added to the database in 2020, shortly before the screenshots were created. This was, obviously, not strong evidence that Craig had held Bitcoin in 2009. This evidence was reported, and was responded to with a couple of additional databases that had an audit trail that was consistent with the dates in the records in question. Well, partially. The audit record included session data, showing an administrator logging into the data base in 2011 and then, uh, logging out in 2023, which is rather more consistent with someone changing their system clock to 2011 to create an entry, and switching it back to present day before logging out. In addition, the audit log included fields that didn't exist in versions of the product released before 2016, strongly suggesting that the entries dated 2009-2011 were created in software released after 2016. And even worse, the order of insertions into the database didn't line up with calendar time - an entry dated before another entry may appear in the database afterwards, indicating that it was created later. But even more obvious? The database schema used for these old entries corresponded to a version of the software released in 2023.

This is all consistent with the idea that these records were created after the fact and backdated to 2009-2011, and that after this evidence was made available further evidence was created and backdated to obfuscate that. In an unusual turn of events, during the trial Craig Wright introduced further evidence in the form of a chain of emails to his former lawyers that indicated he had provided them with login details to his MYOB instance in 2019 - before the metadata associated with the screenshots. The implication isn't entirely clear, but it suggests that either they had an opportunity to examine this data before the metadata suggests it was created, or that they faked the data? So, well, the obvious thing happened, and his former lawyers were asked whether they received these emails. The chain consisted of three emails, two of which they confirmed they'd received. And they received a third email in the chain, but it was different to the one entered in evidence. And, uh, weirdly, they'd received a copy of the email that was submitted - but they'd received it a few days earlier. In 2024.

And again, the forensic evidence is helpful here! It turns out that the email client used associates a timestamp with any attachments, which in this case included an image in the email footer - and the mysterious time travelling email had a timestamp in 2024, not 2019. This was created by the client, so was consistent with the email having been sent in 2024, not being sent in 2019 and somehow getting stuck somewhere before delivery. The date header indicates 2019, as do encoded timestamps in the MIME headers - consistent with the mail being sent by a computer with the clock set to 2019.

But there's a very weird difference between the copy of the email that was submitted in evidence and the copy that was located afterwards! The first included a header inserted by gmail that included a 2019 timestamp, while the latter had a 2024 timestamp. Is there a way to determine which of these could be the truth? It turns out there is! The format of that header changed in 2022, and the version in the email is the new version. The version with the 2019 timestamp is anachronistic - the format simply doesn't match the header that gmail would have introduced in 2019, suggesting that an email sent in 2022 or later was modified to include a timestamp of 2019.

This is by no means the only indication that Craig Wright's evidence may be misleading (there's the whole argument that the Bitcoin white paper was written in LaTeX when general consensus is that it's written in OpenOffice, given that's what the metadata claims), but it's a lovely example of a more general issue.

Our technology chains are complicated. So many moving parts end up influencing the content of the data we generate, and those parts develop over time. It's fantastically difficult to generate an artifact now that precisely corresponds to how it would look in the past, even if we go to the effort of installing an old OS on an old PC and setting the clock appropriately (are you sure you're going to be able to mimic an entirely period appropriate patch level?). Even the version of the font you use in a document may indicate it's anachronistic. I'm pretty good at computers and I no longer have any belief I could fake an old document.

(References: this Dropbox, under "Expert reports", "Patrick Madden". Initial MYOB data is in "Appendix PM7", further analysis is in "Appendix PM42", email analysis is "Sixth Expert Report of Mr Patrick Madden")

comments

19 February 2024

Matthew Garrett: Debugging an odd inability to stream video

We have a cabin out in the forest, and when I say "out in the forest" I mean "in a national forest subject to regulation by the US Forest Service" which means there's an extremely thick book describing the things we're allowed to do and (somewhat longer) not allowed to do. It's also down in the bottom of a valley surrounded by tall trees (the whole "forest" bit). There used to be AT&T copper but all that infrastructure burned down in a big fire back in 2021 and AT&T no longer supply new copper links, and Starlink isn't viable because of the whole "bottom of a valley surrounded by tall trees" thing along with regulations that prohibit us from putting up a big pole with a dish on top. Thankfully there's LTE towers nearby, so I'm simply using cellular data. Unfortunately my provider rate limits connections to video streaming services in order to push them down to roughly SD resolution. The easy workaround is just to VPN back to somewhere else, which in my case is just a Wireguard link back to San Francisco.

This worked perfectly for most things, but some streaming services simply wouldn't work at all. Attempting to load the video would just spin forever. Running tcpdump at the local end of the VPN endpoint showed a connection being established, some packets being exchanged, and then nothing. The remote service appeared to just stop sending packets. Tcpdumping the remote end of the VPN showed the same thing. It wasn't until I looked at the traffic on the VPN endpoint's external interface that things began to become clear.

This probably needs some background. Most network infrastructure has a maximum allowable packet size, which is referred to as the Maximum Transmission Unit or MTU. For ethernet this defaults to 1500 bytes, and these days most links are able to handle packets of at least this size, so it's pretty typical to just assume that you'll be able to send a 1500 byte packet. But what's important to remember is that that doesn't mean you have 1500 bytes of packet payload - that 1500 bytes includes whatever protocol level headers are on there. For TCP/IP you're typically looking at spending around 40 bytes on the headers, leaving somewhere around 1460 bytes of usable payload. And if you're using a VPN, things get annoying. In this case the original packet becomes the payload of a new packet, which means it needs another set of TCP (or UDP) and IP headers, and probably also some VPN header. This still all needs to fit inside the MTU of the link the VPN packet is being sent over, so if the MTU of that is 1500, the effective MTU of the VPN interface has to be lower. For Wireguard, this works out to an effective MTU of 1420 bytes. That means simply sending a 1500 byte packet over a Wireguard (or any other VPN) link won't work - adding the additional headers gives you a total packet size of over 1500 bytes, and that won't fit into the underlying link's MTU of 1500.

And yet, things work. But how? Faced with a packet that's too big to fit into a link, there are two choices - break the packet up into multiple smaller packets ("fragmentation") or tell whoever's sending the packet to send smaller packets. Fragmentation seems like the obvious answer, so I'd encourage you to read Valerie Aurora's article on how fragmentation is more complicated than you think. tl;dr - if you can avoid fragmentation then you're going to have a better life. You can explicitly indicate that you don't want your packets to be fragmented by setting the Don't Fragment bit in your IP header, and then when your packet hits a link where your packet exceeds the link MTU it'll send back a packet telling the remote that it's too big, what the actual MTU is, and the remote will resend a smaller packet. This avoids all the hassle of handling fragments in exchange for the cost of a retransmit the first time the MTU is exceeded. It also typically works these days, which wasn't always the case - people had a nasty habit of dropping the ICMP packets telling the remote that the packet was too big, which broke everything.

What I saw when I tcpdumped on the remote VPN endpoint's external interface was that the connection was getting established, and then a 1500 byte packet would arrive (this is kind of the behaviour you'd expect for video - the connection handshaking involves a bunch of relatively small packets, and then once you start sending the video stream itself you start sending packets that are as large as possible in order to minimise overhead). This 1500 byte packet wouldn't fit down the Wireguard link, so the endpoint sent back an ICMP packet to the remote telling it to send smaller packets. The remote should then have sent a new, smaller packet - instead, about a second after sending the first 1500 byte packet, it sent that same 1500 byte packet. This is consistent with it ignoring the ICMP notification and just behaving as if the packet had been dropped.

All the services that were failing were failing in identical ways, and all were using Fastly as their CDN. I complained about this on social media and then somehow ended up in contact with the engineering team responsible for this sort of thing - I sent them a packet dump of the failure, they were able to reproduce it, and it got fixed. Hurray!

(Between me identifying the problem and it getting fixed I was able to work around it. The TCP header includes a Maximum Segment Size (MSS) field, which indicates the maximum size of the payload for this connection. iptables allows you to rewrite this, so on the VPN endpoint I simply rewrote the MSS to be small enough that the packets would fit inside the Wireguard MTU. This isn't a complete fix since it's done at the TCP level rather than the IP level - so any large UDP packets would still end up breaking)

I've no idea what the underlying issue was, and at the client end the failure was entirely opaque: the remote simply stopped sending me packets. The only reason I was able to debug this at all was because I controlled the other end of the VPN as well, and even then I wouldn't have been able to do anything about it other than being in the fortuitous situation of someone able to do something about it seeing my post. How many people go through their lives dealing with things just being broken and having no idea why, and how do we fix that?

(Edit: thanks to this comment, it sounds like the underlying issue was a kernel bug that Fastly developed a fix for - under certain configurations, the kernel fails to associate the MTU update with the egress interface and so it continues sending overly large packets)

comments

13 February 2024

Matthew Palmer: Not all TLDs are Created Equal

In light of the recent cancellation of the queer.af domain registration by the Taliban, the fragile and difficult nature of country-code top-level domains (ccTLDs) has once again been comprehensively demonstrated. Since many people may not be aware of the risks, I thought I d give a solid explainer of the whole situation, and explain why you should, in general, not have anything to do with domains which are registered under ccTLDs.

Top-level What-Now? A top-level domain (TLD) is the last part of a domain name (the collection of words, separated by periods, after the `https://` in your web browser s location bar). It s the com in `example.com`, or the af in `queer.af`. There are two kinds of TLDs: country-code TLDs (ccTLDs) and generic TLDs (gTLDs). Despite all being TLDs, they re very different beasts under the hood.

What s the Difference? Generic TLDs are what most organisations and individuals register their domains under: old-school technobabble like com , net , or org , historical oddities like gov , and the new-fangled world of words like tech , social , and bank . These gTLDs are all regulated under a set of rules created and administered by ICANN (the Internet Corporation for Assigned Names and Numbers ), which try to ensure that things aren t a complete wild-west, limiting things like price hikes (well, sometimes, anyway), and providing means for disputes over names¹. Country-code TLDs, in contrast, are all two letters long², and are given out to countries to do with as they please. While ICANN kinda-sorta has something to do with ccTLDs (in the sense that it makes them exist on the Internet), it has no authority to control how a ccTLD is managed. If a country decides to raise prices by 100x, or cancel all registrations that were made on the 12th of the month, there s nothing anyone can do about it. If that sounds bad, that s because it is. Also, it s not a theoretical problem the Taliban deciding to asssert its bigotry over the little corner of the Internet namespace it has taken control of is far from the first time that ccTLDs have caused grief.

Shifting Sands The `queer.af` cancellation is interesting because, at the time the domain was reportedly registered, 2018, Afghanistan had what one might describe as, at least, a different political climate. Since then, of course, things have changed, and the new bosses have decided to get a bit more active. Those running `queer.af` seem to have seen the writing on the wall, and were planning on moving to another, less fraught, domain, but hadn t completed that move when the Taliban came knocking.

The Curious Case of Brexit When the United Kingdom decided to leave the European Union, it fell foul of the EU s rules for the registration of domains under the eu ccTLD³. To register (and maintain) a domain name ending in `.eu`, you have to be a resident of the EU. When the UK ceased to be part of the EU, residents of the UK were no longer EU residents. Cue much unhappiness, wailing, and gnashing of teeth when this was pointed out to Britons. Some decided to give up their domains, and move to other parts of the Internet, while others managed to hold onto them by various legal sleight-of-hand (like having an EU company maintain the registration on their behalf). In any event, all very unpleasant for everyone involved.

Geopolitics on the Internet?!? After Russia invaded Ukraine in February 2022, the Ukranian Vice Prime Minister asked ICANN to suspend ccTLDs associated with Russia. While ICANN said that it wasn t going to do that, because it wouldn t do anything useful, some domain registrars (the companies you pay to register domain names) ceased to deal in Russian ccTLDs, and some websites restricted links to domains with Russian ccTLDs. Whether or not you agree with the sort of activism implied by these actions, the fact remains that even the actions of a government that aren t directly related to the Internet can have grave consequences for your domain name if it s registered under a ccTLD. I don t think any gTLD operator will be invading a neighbouring country any time soon.

Money, Money, Money, Must Be Funny When you register a domain name, you pay a registration fee to a registrar, who does administrative gubbins and causes you to be able to control the domain name in the DNS. However, you don t own that domain name⁴ you re only renting it. When the registration period comes to an end, you have to renew the domain name, or you ll cease to be able to control it. Given that a domain name is typically your brand or identity online, the chances are you d prefer to keep it over time, because moving to a new domain name is a massive pain, having to tell all your customers or users that now you re somewhere else, plus having to accept the risk of someone registering the domain name you used to have and capturing your traffic it s all a gigantic hassle. For gTLDs, ICANN has various rules around price increases and bait-and-switch pricing that tries to keep a lid on the worst excesses of registries. While there are any number of reasonable criticisms of the rules, and the Internet community has to stay on their toes to keep ICANN from totally succumbing to regulatory capture, at least in the gTLD space there s some degree of control over price gouging. On the other hand, ccTLDs have no effective controls over their pricing. For example, in 2008 the Seychelles increased the price of `.sc` domain names from US$25 to US$75. No reason, no warning, just pay up .

Who Is Even Getting That Money? A closely related concern about ccTLDs is that some of the cool ones are assigned to countries that are not great. The poster child for this is almost certainly Libya, which has the ccTLD ly . While Libya was being run by a terrorist-supporting extremist, companies thought it was a great idea to have domain names that ended in `.ly`. These domain registrations weren t (and aren t) cheap, and it s hard to imagine that at least some of that money wasn t going to benefit the Gaddafi regime. Similarly, the British Indian Ocean Territory, which has the io ccTLD, was created in a colonialist piece of chicanery that expelled thousands of native Chagossians from Diego Garcia. Money from the registration of `.io` domains doesn t go to the (former) residents of the Chagos islands, instead it gets paid to the UK government. Again, I m not trying to suggest that all gTLD operators are wonderful people, but it s not particularly likely that the direct beneficiaries of the operation of a gTLD stole an island chain and evicted the residents.

Are ccTLDs Ever Useful? The answer to that question is an unqualified maybe . I certainly don t think it s a good idea to register a domain under a ccTLD for vanity purposes: because it makes a word, is the same as a file extension you like, or because it looks cool. Those ccTLDs that clearly represent and are associated with a particular country are more likely to be OK, because there is less impetus for the registry to try a naked cash grab. Unfortunately, ccTLD registries have a disconcerting habit of changing their minds on whether they serve their geographic locality, such as when auDA decided to declare an open season in the `.au` namespace some years ago. Essentially, while a ccTLD may have geographic connotations now, there s not a lot of guarantee that they won t fall victim to scope creep in the future. Finally, it might be somewhat safer to register under a ccTLD if you live in the location involved. At least then you might have a better idea of whether your domain is likely to get pulled out from underneath you. Unfortunately, as the `.eu` example shows, living somewhere today is no guarantee you ll still be living there tomorrow, even if you don t move house. In short, I d suggest sticking to gTLDs. They re at least lower risk than ccTLDs.

+1, Helpful If you ve found this post informative, why not buy me a refreshing beverage? My typing fingers (both of them) thank you in advance for your generosity.

Footnotes

don t make the mistake of thinking that I approve of ICANN or how it operates; it s an omnishambles of poor governance and incomprehensible decision-making.

corresponding roughly, though not precisely (because everything has to be complicated, because humans are complicated), to the entries in the ISO standard for Codes for the representation of names of countries and their subdivisions , ISO 3166.

yes, the EU is not a country; it s part of the roughly, though not precisely caveat mentioned previously.

despite what domain registrars try very hard to imply, without falling foul of deceptive advertising regulations.

30 January 2024

Matthew Palmer: Why Certificate Lifecycle Automation Matters

If you ve perused the ActivityPub feed of certificates whose keys are known to be compromised, and clicked on the Show More button to see the name of the certificate issuer, you may have noticed that some issuers seem to come up again and again. This might make sense after all, if a CA is issuing a large volume of certificates, they ll be seen more often in a list of compromised certificates. In an attempt to see if there is anything that we can learn from this data, though, I did a bit of digging, and came up with some illuminating results.

The Procedure I started off by finding all the unexpired certificates logged in Certificate Transparency (CT) logs that have a key that is in the pwnedkeys database as having been publicly disclosed. From this list of certificates, I removed duplicates by matching up issuer/serial number tuples, and then reduced the set by counting the number of unique certificates by their issuer. This gave me a list of the issuers of these certificates, which looks a bit like this:

/C=BE/O=GlobalSign nv-sa/CN=AlphaSSL CA - SHA256 - G4
/C=GB/ST=Greater Manchester/L=Salford/O=Sectigo Limited/CN=Sectigo RSA Domain Validation Secure Server CA
/C=GB/ST=Greater Manchester/L=Salford/O=Sectigo Limited/CN=Sectigo RSA Organization Validation Secure Server CA
/C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certs.godaddy.com/repository//CN=Go Daddy Secure Certificate Authority - G2
/C=US/ST=Arizona/L=Scottsdale/O=Starfield Technologies, Inc./OU=http://certs.starfieldtech.com/repository//CN=Starfield Secure Certificate Authority - G2
/C=AT/O=ZeroSSL/CN=ZeroSSL RSA Domain Secure Site CA
/C=BE/O=GlobalSign nv-sa/CN=GlobalSign GCC R3 DV TLS CA 2020

Rather than try to work with raw issuers (because, as Andrew Ayer says, The SSL Certificate Issuer Field is a Lie), I mapped these issuers to the organisations that manage them, and summed the counts for those grouped issuers together.

The Data
Insert obligatory "not THAT data" comment here
The end result of this work is the following table, sorted by the count of certificates which have been compromised by exposing their private key:

Issuer Compromised Count

Sectigo 170

ISRG (Let's Encrypt) 161

GoDaddy 141

DigiCert 81

GlobalSign 46

Entrust 3

SSL.com 1

If you re familiar with the CA ecosystem, you ll probably recognise that the organisations with large numbers of compromised certificates are also those who issue a lot of certificates. So far, nothing particularly surprising, then. Let s look more closely at the relationships, though, to see if we can get more useful insights.

Issuer	Compromised Count
Sectigo	170
ISRG (Let's Encrypt)	161
GoDaddy	141
DigiCert	81
GlobalSign	46
Entrust	3
SSL.com	1

Volume Control Using the issuance volume report from crt.sh, we can compare issuance volumes to compromise counts, to come up with a compromise rate . I m using the Unexpired Precertificates colume from the issuance volume report, as I feel that s the number that best matches the certificate population I m examining to find compromised certificates. To maintain parity with the previous table, this one is still sorted by the count of certificates that have been compromised.

Issuer Issuance Volume Compromised Count Compromise Rate

Sectigo 88,323,068 170 1 in 519,547

ISRG (Let's Encrypt) 315,476,402 161 1 in 1,959,480

GoDaddy 56,121,429 141 1 in 398,024

DigiCert 144,713,475 81 1 in 1,786,586

GlobalSign 1,438,485 46 1 in 31,271

Entrust 23,166 3 1 in 7,722

SSL.com 171,816 1 1 in 171,816

If we now sort this table by compromise rate, we can see which organisations have the most (and least) leakiness going on from their customers:

Issuer Issuance Volume Compromised Count Compromise Rate

Entrust 23,166 3 1 in 7,722

GlobalSign 1,438,485 46 1 in 31,271

SSL.com 171,816 1 1 in 171,816

GoDaddy 56,121,429 141 1 in 398,024

Sectigo 88,323,068 170 1 in 519,547

DigiCert 144,713,475 81 1 in 1,786,586

ISRG (Let's Encrypt) 315,476,402 161 1 in 1,959,480

By grouping by order-of-magnitude in the compromise rate, we can identify three bands :

The Super Leakers: Customers of Entrust and GlobalSign seem to love to lose control of their private keys. For Entrust, at least, though, the small volumes involved make the numbers somewhat untrustworthy. The three compromised certificates could very well belong to just one customer, for instance. I m not aware of anything that GlobalSign does that would make them such an outlier, either, so I m inclined to think they just got unlucky with one or two customers, but as CAs don t include customer IDs in the certificates they issue, it s not possible to say whether that s the actual cause or not.

The Regular Leakers: Customers of SSL.com, GoDaddy, and Sectigo all have compromise rates in the 1-in-hundreds-of-thousands range. Again, the low volumes of SSL.com make the numbers somewhat unreliable, but the other two organisations in this group have large enough numbers that we can rely on that data fairly well, I think.

The Low Leakers: Customers of DigiCert and Let s Encrypt are at least three times less likely than customers of the regular leakers to lose control of their private keys. Good for them!

Now we have some useful insights we can think about.

Issuer	Issuance Volume	Compromised Count	Compromise Rate
Sectigo	88,323,068	170	1 in 519,547
ISRG (Let's Encrypt)	315,476,402	161	1 in 1,959,480
GoDaddy	56,121,429	141	1 in 398,024
DigiCert	144,713,475	81	1 in 1,786,586
GlobalSign	1,438,485	46	1 in 31,271
Entrust	23,166	3	1 in 7,722
SSL.com	171,816	1	1 in 171,816

Why Is It So?
If you don't know who Professor Julius Sumner Miller is, I highly recommend finding out
All of the organisations on the list, with the exception of Let s Encrypt, are what one might term traditional CAs. To a first approximation, it s reasonable to assume that the vast majority of the customers of these traditional CAs probably manage their certificates the same way they have for the past two decades or more. That is, they generate a key and CSR, upload the CSR to the CA to get a certificate, then copy the cert and key somewhere. Since humans are handling the keys, there s a higher risk of the humans using either risky practices, or making a mistake, and exposing the private key to the world. Let s Encrypt, on the other hand, issues all of its certificates using the ACME (Automatic Certificate Management Environment) protocol, and all of the Let s Encrypt documentation encourages the use of software tools to generate keys, issue certificates, and install them for use. Given that Let s Encrypt has 161 compromised certificates currently in the wild, it s clear that the automation in use is far from perfect, but the significantly lower compromise rate suggests to me that lifecycle automation at least reduces the rate of key compromise, even though it doesn t eliminate it completely.

Sidebar: ACME Does Not Currently Rule The World It is true that all of the organisations in this analysis also provide ACME issuance workflows, should customers desire it. However, the traditional CA companies have been around a lot longer than ACME has, and so they probably acquired many of their customers before ACME existed. Given that it s incredibly hard to get humans to change the way they do things, once they have a way that works , it seems reasonable to assume that most of the certificates issued by these CAs are handled in the same human-centric, error-prone manner they always have been. If organisations would like to refute this assumption, though, by sharing their data on ACME vs legacy issuance rates, I m sure we d all be extremely interested.

Explaining the Outlier The difference in presumed issuance practices would seem to explain the significant difference in compromise rates between Let s Encrypt and the other organisations, if it weren t for one outlier. This is a largely traditional CA, with the manual-handling issues that implies, but with a compromise rate close to that of Let s Encrypt. We are, of course, talking about DigiCert. The thing about DigiCert, that doesn t show up in the raw numbers from crt.sh, is that DigiCert manages the issuance of certificates for several of the biggest hosted TLS providers, such as CloudFlare and AWS. When these services obtain a certificate from DigiCert on their customer s behalf, the private key is kept locked away, and no human can (we hope) get access to the private key. This is supported by the fact that no certificates identifiably issued to either CloudFlare or AWS appear in the set of certificates with compromised keys. When we ask for all certificates issued by DigiCert , we get both the certificates issued to these big providers, which are very good at keeping their keys under control, as well as the certificates issued to everyone else, whose key handling practices may not be quite so stringent. It s possible, though not trivial, to account for certificates issued to these hosted TLS providers, because the certificates they use are issued from intermediates branded to those companies. With the crt.sh psql interface we can run this query to get the total number of unexpired precertificates issued to these managed services:

SELECT SUM(sub.NUM_ISSUED[2] - sub.NUM_EXPIRED[2])
  FROM (
    SELECT ca.name, max(coalesce(coalesce(nullif(trim(cc.SUBORDINATE_CA_OWNER), ''), nullif(trim(cc.CA_OWNER), '')), cc.INCLUDED_CERTIFICATE_OWNER)) as OWNER,
           ca.NUM_ISSUED, ca.NUM_EXPIRED
      FROM ccadb_certificate cc, ca_certificate cac, ca
     WHERE cc.CERTIFICATE_ID = cac.CERTIFICATE_ID
       AND cac.CA_ID = ca.ID
  GROUP BY ca.ID
  ) sub
 WHERE sub.name ILIKE '%Amazon%' OR sub.name ILIKE '%CloudFlare%' AND sub.owner = 'DigiCert';

The number I get from running that query is 104,316,112, which should be subtracted from DigiCert s total issuance figures to get a more accurate view of what DigiCert s regular customers do with their private keys. When I do this, the compromise rates table, sorted by the compromise rate, looks like this:

Issuer	Issuance Volume	Compromised Count	Compromise Rate
Entrust	23,166	3	1 in 7,722
GlobalSign	1,438,485	46	1 in 31,271
SSL.com	171,816	1	1 in 171,816
GoDaddy	56,121,429	141	1 in 398,024
"Regular" DigiCert	40,397,363	81	1 in 498,732
Sectigo	88,323,068	170	1 in 519,547
All DigiCert	144,713,475	81	1 in 1,786,586
ISRG (Let's Encrypt)	315,476,402	161	1 in 1,959,480

In short, it appears that DigiCert s regular customers are just as likely as GoDaddy or Sectigo customers to expose their private keys.

What Does It All Mean? The takeaway from all this is fairly straightforward, and not overly surprising, I believe.
The less humans have to do with certificate issuance, the less likely they are to compromise that certificate by exposing the private key. While it may not be surprising, it is nice to have some empirical evidence to back up the common wisdom. Fully-managed TLS providers, such as CloudFlare, AWS Certificate Manager, and whatever Azure s thing is called, is the platonic ideal of this principle: never give humans any opportunity to expose a private key. I m not saying you should use one of these providers, but the security approach they have adopted appears to be the optimal one, and should be emulated universally. The ACME protocol is the next best, in that there are a variety of standardised tools widely available that allow humans to take themselves out of the loop, but it s still possible for humans to handle (and mistakenly expose) key material if they try hard enough. Legacy issuance methods, which either cannot be automated, or require custom, per-provider automation to be developed, appear to be at least four times less helpful to the goal of avoiding compromise of the private key associated with a certificate.

Humans Are, Of Course, The Problem
No thanks, Bender, I'm busy tonight
This observation that if you don t let humans near keys, they don t get leaked is further supported by considering the biggest issuers by volume who have not issued any certificates whose keys have been compromised: Google Trust Services (fourth largest issuer overall, with 57,084,529 unexpired precertificates), and Microsoft Corporation (sixth largest issuer overall, with 22,852,468 unexpired precertificates). It appears that somewhere between most and basically all of the certificates these organisations issue are to customers of their public clouds, and my understanding is that the keys for these certificates are managed in same manner as CloudFlare and AWS the keys are locked away where humans can t get to them. It should, of course, go without saying that if a human can never have access to a private key, it makes it rather difficult for a human to expose it. More broadly, if you are building something that handles sensitive or secret data, the more you can do to keep humans out of the loop, the better everything will be.

Your Support is Appreciated If you d like to see more analysis of how key compromise happens, and the lessons we can learn from examining billions of certificates, please show your support by buying me a refreshing beverage. Trawling CT logs is thirsty work.

Appendix: Methodology Limitations In the interests of clarity, I feel it s important to describe ways in which my research might be flawed. Here are the things I know of that may have impacted the accuracy, that I couldn t feasibly account for.

Time Periods: Because time never stops, there is likely to be some slight mismatches in the numbers obtained from the various data sources, because they weren t collected at exactly the same moment.

Issuer-to-Organisation Mapping: It s possible that the way I mapped issuers to organisations doesn t match exactly with how crt.sh does it, meaning that counts might be skewed. I tried to minimise that by using the same data sources (the CCADB AllCertificates report) that I believe that crt.sh uses for its mapping, but I cannot be certain of a perfect match.

Unwarranted Grouping: I ve drawn some conclusions about the practices of the various organisations based on their general approach to certificate issuance. If a particular subordinate CA that I ve grouped into the parent organisation is managed in some unusual way, that might cause my conclusions to be erroneous. I was able to fairly easily separate out CloudFlare, AWS, and Azure, but there are almost certainly others that I didn t spot, because hoo boy there are a lot of intermediate CAs out there.

16 January 2024

Matthew Palmer: Pwned Certificates on the Fediverse

As well as the collection and distribution of compromised keys, the pwnedkeys project also matches those pwned keys against issued SSL certificates. I m excited to announce that, as of the beginning of 2024, all matched certificates are now being published on the Fediverse, thanks to the botsin.space Mastodon server. Want to know which sites are susceptible to interception and interference, in (near-)real time? Do you have a burning desire to know who is issuing certificates to people that post their private keys in public? Now you can.

How It Works The process for publishing pwned certs is, roughly, as follows:

All the certificates in Certificate Transparency (CT) logs are hoovered up (using my scrape-ct-log tool, the fastest log scraper in the west!), and the fingerprint of the public key of each certificate is stored in an LMDB datafile.

As new private keys are identified as having been compromised, the fingerprint of that key is checked against all the LMDB files, which map key fingerprints to certificates (actually to CT log entry IDs, from which the certificates themselves are retrieved).

If one or more matches are found, then the certificates using the compromised key are forwarded to the tooter , which publishes them for the world to marvel at.

This makes it sound all very straightforward, and it is in theory. The trick comes in optimising the pipeline so that the five million or so new certificates every day can get indexed on the one slightly middle-aged server I ve got, without getting backlogged.

Why Don t You Just Have the Certificates Revoked? Funny story about that I used to notify CAs of certificates they d issued using compromised keys, which had the effect of requiring them to revoke the associated certificates. However, several CAs disliked having to revoke all those certificates, because it cost them staff time (and hence money) to do so. They went so far as to change their procedures from the standard way of accepting problem reports (emailing a generic attestation of compromise), and instead required CA-specific hoop-jumping to notify them of compromised keys. Since the effectiveness of revocation in the WebPKI is, shall we say, homeopathic at best, I decided I couldn t be bothered to play whack-a-mole with CAs that just wanted to be difficult, and I stopped sending compromised key notifications to CAs. Instead, now I m publishing the details of compromised certificates to everyone, so that users can protect themselves directly should they choose to.

Further Work The astute amongst you may have noticed, in the above How It Works description, a bit of a gap in my scanning coverage. CAs can (and do!) issue certificates for keys that are already compromised, including weak keys that have been known about for a decade or more (1, 2, 3). However, as currently implemented, the pwnedkeys certificate checker does not automatically find such certificates. My plan is to augment the CT scraping / cert processing pipeline to check all incoming certificates against the existing (2M+) set of pwned keys. Though, with over five million new certificates to check every day, it s not necessarily as simple as just hit the pwnedkeys API for every new cert . The poor old API server might not like that very much.

Support My Work If you d like to see this extra matching happen a bit quicker, I ve setup a ko-fi supporters page, where you can support my work on pwnedkeys and the other open source software and projects I work on by buying me a refreshing beverage. I would be very appreciative, and your support lets me know I should do more interesting things with the giant database of compromised keys I ve accumulated.

2 January 2024

Matthew Garrett: Dealing with weird ELF libraries

Libraries are collections of code that are intended to be usable by multiple consumers (if you're interested in the etymology, watch this video). In the old days we had what we now refer to as "static" libraries, collections of code that existed on disk but which would be copied into newly compiled binaries. We've moved beyond that, thankfully, and now make use of what we call "dynamic" or "shared" libraries - instead of the code being copied into the binary, a reference to the library function is incorporated, and at runtime the code is mapped from the on-disk copy of the shared object[1]. This allows libraries to be upgraded without needing to modify the binaries using them, and if multiple applications are using the same library at once it only requires that one copy of the code be kept in RAM.

But for this to work, two things are necessary: when we build a binary, there has to be a way to reference the relevant library functions in the binary; and when we run a binary, the library code needs to be mapped into the process.

(I'm going to somewhat simplify the explanations from here on - things like symbol versioning make this a bit more complicated but aren't strictly relevant to what I was working on here)

For the first of these, the goal is to replace a call to a function (eg, printf()) with a reference to the actual implementation. This is the job of the linker rather than the compiler (eg, if you use the -c argument to tell gcc to simply compile to an object rather than linking an executable, it's not going to care about whether or not every function called in your code actually exists or not - that'll be figured out when you link all the objects together), and the linker needs to know which symbols (which aren't just functions - libraries can export variables or structures and so on) are available in which libraries. You give the linker a list of libraries, it extracts the symbols available, and resolves the references in your code with references to the library.

But how is that information extracted? Each ELF object has a fixed-size header that contains references to various things, including a reference to a list of "section headers". Each section has a name and a type, but the ones we're interested in are .dynstr and .dynsym. .dynstr contains a list of strings, representing the name of each exported symbol. .dynsym is where things get more interesting - it's a list of structs that contain information about each symbol. This includes a bunch of fairly complicated stuff that you need to care about if you're actually writing a linker, but the relevant entries for this discussion are an index into .dynstr (which means the .dynsym entry isn't sufficient to know the name of a symbol, you need to extract that from .dynstr), along with the location of that symbol within the library. The linker can parse this information and obtain a list of symbol names and addresses, and can now replace the call to printf() with a reference to libc instead.

(Note that it's not possible to simply encode this as "Call this address in this library" - if the library is rebuilt or is a different version, the function could move to a different location)

Experimentally, .dynstr and .dynsym appear to be sufficient for linking a dynamic library at build time - there are other sections related to dynamic linking, but you can link against a library that's missing them. Runtime is where things get more complicated.

When you run a binary that makes use of dynamic libraries, the code from those libraries needs to be mapped into the resulting process. This is the job of the runtime dynamic linker, or RTLD[2]. The RTLD needs to open every library the process requires, map the relevant code into the process's address space, and then rewrite the references in the binary into calls to the library code. This requires more information than is present in .dynstr and .dynsym - at the very least, it needs to know the list of required libraries.

There's a separate section called .dynamic that contains another list of structures, and it's the data here that's used for this purpose. For example, .dynamic contains a bunch of entries of type DT_NEEDED - this is the list of libraries that an executable requires. There's also a bunch of other stuff that's required to actually make all of this work, but the only thing I'm going to touch on is DT_HASH. Doing all this re-linking at runtime involves resolving the locations of a large number of symbols, and if the only way you can do that is by reading a list from .dynsym and then looking up every name in .dynstr that's going to take some time. The DT_HASH entry points to a hash table - the RTLD hashes the symbol name it's trying to resolve, looks it up in that hash table, and gets the symbol entry directly (it still needs to resolve that against .dynstr to make sure it hasn't hit a hash collision - if it has it needs to look up the next hash entry, but this is still generally faster than walking the entire .dynsym list to find the relevant symbol). There's also DT_GNU_HASH which fulfills the same purpose as DT_HASH but uses a more complicated algorithm that performs even better. .dynamic also contains entries pointing at .dynstr and .dynsym, which seems redundant but will become relevant shortly.

So, .dynsym and .dynstr are required at build time, and both are required along with .dynamic at runtime. This seems simple enough, but obviously there's a twist and I'm sorry it's taken so long to get to this point.

I bought a Synology NAS for home backup purposes (my previous solution was a single external USB drive plugged into a small server, which had uncomfortable single point of failure properties). Obviously I decided to poke around at it, and I found something odd - all the libraries Synology ships were entirely lacking any ELF section headers. This meant no .dynstr, .dynsym or .dynamic sections, so how was any of this working? nm asserted that the libraries exported no symbols, and readelf agreed. If I wrote a small app that called a function in one of the libraries and built it, gcc complained that the function was undefined. But executables on the device were clearly resolving the symbols at runtime, and if I loaded them into ghidra the exported functions were visible. If I dlopen()ed them, dlsym() couldn't resolve the symbols - but if I hardcoded the offset into my code, I could call them directly.

Things finally made sense when I discovered that if I passed the --use-dynamic argument to readelf, I did get a list of exported symbols. It turns out that ELF is weirder than I realised. As well as the aforementioned section headers, ELF objects also include a set of program headers. One of the program header types is PT_DYNAMIC. This typically points to the same data that's present in the .dynamic section. Remember when I mentioned that .dynamic contained references to .dynsym and .dynstr? This means that simply pointing at .dynamic is sufficient, there's no need to have separate entries for them.

The same information can be reached from two different locations. The information in the section headers is used at build time, and the information in the program headers at run time[3]. I do not have an explanation for this. But if the information is present in two places, it seems obvious that it should be able to reconstruct the missing section headers in my weird libraries? So that's what this does. It extracts information from the DYNAMIC entry in the program headers and creates equivalent section headers.

There's one thing that makes this more difficult than it might seem. The section header for .dynsym has to contain the number of symbols present in the section. And that information doesn't directly exist in DYNAMIC - to figure out how many symbols exist, you're expected to walk the hash tables and keep track of the largest number you've seen. Since every symbol has to be referenced in the hash table, once you've hit every entry the largest number is the number of exported symbols. This seemed annoying to implement, so instead I cheated, added code to simply pass in the number of symbols on the command line, and then just parsed the output of readelf against the original binaries to extract that information and pass it to my tool.

Somehow, this worked. I now have a bunch of library files that I can link into my own binaries to make it easier to figure out how various things on the Synology work. Now, could someone explain (a) why this information is present in two locations, and (b) why the build-time linker and run-time linker disagree on the canonical source of truth?

[1] "Shared object" is the source of the .so filename extension used in various Unix-style operating systems
[2] You'll note that "RTLD" is not an acryonym for "runtime dynamic linker", because reasons
[3] For environments using the GNU RTLD, at least - I have no idea whether this is the case in all ELF environments

comments

20 December 2023

Russell Coker: Abuse and Free Software

People in positions of power can get away with mistreating other people. For any organisation to operate effectively there have to be mechanisms to address bad behaviour, both to help the organisation to achieve it s goals and to protect people who work for it. When an organisation operates in the public interest there is a greater reason to try to prevent bad behaviour as hurting people is not in the public interest. There are many forms of power, in the free software community a reputation for doing good technical work or work related to supporting software development gives some power and influence. We have seen examples of technical contributions used to excuse mistreatment of other people. The latest example of using a professional reputation to cover for abuse is Eben Moglen who has done some good legal work in the past while also treating members of the community badly (as documented by Matthew Garrett) [1]. Matthew has also documented how since 2016 Eben has not been doing good work for the free software community [2]. When news comes out about people who did good work while abusing other people they are usually defended with claims such as we can t lose the great contributions of this one person so it s worth losing the contributions of everyone who can t work with them , but in such situations it s very common to discover that they haven t been doing great work. This might be partly due to abusive people being better at self-promoting than actually doing good work and might be partly due to the fact that people who are afraid to speak out when they are doing good work might suddenly feel ready to go public if the person s work (defence) is decreasing. Bradley Kuhn s article about this situation is worth reading [3]. I don t have as much knowledge of the people involved in these disputes as Matthew, but I know enough about what is happening to be confident that Matthew s summary is accurate.

19 December 2023

Matthew Garrett: Making SSH host certificates more usable

Earlier this year, after Github accidentally committed their private RSA SSH host key to a public repository, I wrote about how better support for SSH host certificates would allow this sort of situation to be handled in a user-transparent way without any negative impact on security. I was hoping that someone would read this and be inspired to fix the problem but sadly that didn't happen so I've actually written some code myself.

The core part of this is straightforward - if a server presents you with a certificate associated with a host key, then make the trust in that host be whoever signed the certificate rather than just trusting the host key. This means that if someone needs to replace the host key for any reason (such as, for example, them having published the private half), you can replace the host key with a new key and a new certificate, and as long as the new certificate is signed by the same key that the previous certificate was, you'll trust the new key and key rotation can be carried out without any user errors. Hurrah!

So obviously I wrote that bit and then thought about the failure modes and it turns out there's an obvious one - if an attacker obtained both the private key and the certificate, what stops them from continuing to use it? The certificate isn't a secret, so we basically have to assume that anyone who possesses the private key has access to it. We may have silently transitioned to a new host key on the legitimate servers, but a hostile actor able to MITM a user can keep on presenting the old key and the old certificate until it expires.

There's two ways to deal with this - either have short-lived certificates (ie, issue a new certificate every 24 hours or so even if you haven't changed the key, and specify that the certificate is invalid after those 24 hours), or have a mechanism to revoke the certificates. The former is viable if you have a very well-engineered certificate issuing operation, but still leaves a window for an attacker to make use of the certificate before it expires. The latter is something SSH has support for, but the spec doesn't define any mechanism for distributing revocation data.

So, I've implemented a new SSH protocol extension that allows a host to send a key revocation list to a client. The idea is that the client authenticates to the server, receives a key revocation list, and will no longer trust any certificates that are contained within that list. This seems simple enough, but a naive implementation opens the client to various DoS attacks. For instance, if you simply revoke any key contained within the received KRL, a hostile server could revoke any certificates that were otherwise trusted by the client. The easy way around this is for the client to ensure that any revoked keys are associated with the same CA that signed the host certificate - that way a compromised host can only revoke certificates associated with that CA, and can't interfere with anyone else.

Unfortunately that still means that a single compromised host can still trigger revocation of certificates inside that trust domain (ie, a compromised host a.test.com could push a KRL that invalidated the certificate for b.test.com), because there's no way in the KRL format to indicate that a given revocation is associated with a specific hostname. This means we need a mechanism to verify that the KRL update is legitimate, and the easiest way to handle that is to sign it. The KRL format specifies an in-band signature but this was deprecated earlier this year - instead KRLs are supposed to be signed with the sshsig format. But we control both the server and the client, which means it's easy enough to send a detached signature as part of the extension data.

Putting this all together: you ssh to a server you've never contacted before, and it presents you with a host certificate. Instead of the host key being added to known_hosts, the CA key associated with the certificate is added. From now on, if you ssh to that host and it presents a certificate signed by that CA, it'll be trusted. Optionally, the host can also send you a KRL and a signature. If the signature is generated by the CA key that you already trust, any certificates in that KRL associated with that CA key will be incorporated into local storage. The expected flow if a key is compromised is that the owner of the host generates a new keypair, obtains a new certificate for the new key, and adds the old certificate to a KRL that is signed with the CA key. The next time the user connects to that host, they receive the new key and new certificate, trust it because it's signed by the same CA key, and also receive a KRL signed with the same CA that revokes trust in the old certificate.

Obviously this breaks down if a user is MITMed with a compromised key and certificate immediately after the host is compromised - they'll see a legitimate certificate and won't receive any revocation list, so will trust the host. But this is the same failure mode that would occur in the absence of keys, where the attacker simply presents the compromised key to the client before trust in the new key has been created. This seems no worse than the status quo, but means that most users will seamlessly transition to a new key and revoke trust in the old key with no effort on their part.

The work in progress tree for this is here - at the point of writing I've merely implemented this and made sure it builds, not verified that it actually works or anything. Cleanup should happen over the next few days, and I'll propose this to upstream if it doesn't look like there's any showstopper design issues.

comments

5 December 2023

Matthew Garrett: Why does Gnome fingerprint unlock not unlock the keyring?

There's a decent number of laptops with fingerprint readers that are supported by Linux, and Gnome has some nice integration to make use of that for authentication purposes. But if you log in with a fingerprint, the moment you start any app that wants to access stored passwords you'll get a prompt asking you to type in your password, which feels like it somewhat defeats the point. Mac users don't have this problem - authenticate with TouchID and all your passwords are available after login. Why the difference?

Fingerprint detection can be done in two primary ways. The first is that a fingerprint reader is effectively just a scanner - it passes a graphical representation of the fingerprint back to the OS and the OS decides whether or not it matches an enrolled finger. The second is for the fingerprint reader to make that determination itself, either storing a set of trusted fingerprints in its own storage or supporting being passed a set of encrypted images to compare against. Fprint supports both of these, but note that in both cases all that we get at the end of the day is a statement of "The fingerprint matched" or "The fingerprint didn't match" - we can't associate anything else with that.

Apple's solution involves wiring the fingerprint reader to a secure enclave, an independently running security chip that can store encrypted secrets or keys and only release them under pre-defined circumstances. Rather than the fingerprint reader providing information directly to the OS, it provides it to the secure enclave. If the fingerprint matches, the secure enclave can then provide some otherwise secret material to the OS. Critically, if the fingerprint doesn't match, the enclave will never release this material.

And that's the difference. When you perform TouchID authentication, the secure enclave can decide to release a secret that can be used to decrypt your keyring. We can't easily do this under Linux because we don't have an interface to store those secrets. The secret material can't just be stored on disk - that would allow anyone who had access to the disk to use that material to decrypt the keyring and get access to the passwords, defeating the object. We can't use the TPM because there's no secure communications channel between the fingerprint reader and the TPM, so we can't configure the TPM to release secrets only if an associated fingerprint is provided.

So the simple answer is that fingerprint unlock doesn't unlock the keyring because there's currently no secure way to do that. It's not intransigence on the part of the developers or a conspiracy to make life more annoying. It'd be great to fix it, but I don't see an easy way to do so at the moment.

comments

7 November 2023

Matthew Palmer: PostgreSQL Encryption: The Available Options

On an episode of Postgres FM, the hosts had a (very brief) discussion of data encryption in PostgreSQL. While Postgres FM is a podcast well worth a subscribe, the hosts aren t data security experts, and so as someone who builds a queryable database encryption system, I found the coverage to be somewhat lacking. I figured I d provide a more complete survey of the available options for PostgreSQL-related data encryption.

The Status Quo By default, when you install PostgreSQL, there is no data encryption at all. That means that anyone who gets access to any part of the system can read all the data they have access to. This is, of course, not peculiar to PostgreSQL: basically everything works much the same way. What s stopping an attacker from nicking off with all your data is the fact that they can t access the database at all. The things that are acting as protection are perimeter defences, like putting the physical equipment running the server in a secure datacenter, firewalls to prevent internet randos connecting to the database, and strong passwords. This is referred to as tortoise security it s tough on the outside, but soft on the inside. Once that outer shell is cracked, the delicious, delicious data is ripe for the picking, and there s absolutely nothing to stop a miscreant from going to town and making off with everything. It s a good idea to plan your defenses on the assumption you re going to get breached sooner or later. Having good defence-in-depth includes denying the attacker to your data even if they compromise the database. This is where encryption comes in.

Storage-Layer Defences: Disk / Volume Encryption To protect against the compromise of the storage that your database uses (physical disks, EBS volumes, and the like), it s common to employ encryption-at-rest, such as full-disk encryption, or volume encryption. These mechanisms protect against offline attacks, but provide no protection while the system is actually running. And therein lies the rub: your database is always running, so encryption at rest typically doesn t provide much value. If you re running physical systems, disk encryption is essential, but more to prevent accidental data loss, due to things like failing to wipe drives before disposing of them, rather than physical theft. In systems where volume encryption is only a tickbox away, it s also worth enabling, if only to prevent inane questions from your security auditors. Relying solely on storage-layer defences, though, is very unlikely to provide any appreciable value in preventing data loss.

Database-Layer Defences: Transparent Database Encryption If you ve used proprietary database systems in high-security environments, you might have come across Transparent Database Encryption (TDE). There are also a couple of proprietary extensions for PostgreSQL that provide this functionality. TDE is essentially encryption-at-rest implemented inside the database server. As such, it has much the same drawbacks as disk encryption: few real-world attacks are thwarted by it. There is a very small amount of additional protection, in that physical level backups (as produced by `pg_basebackup`) are protected, but the vast majority of attacks aren t stopped by TDE. Any attacker who can access the database while it s running can just ask for an SQL-level dump of the stored data, and they ll get the unencrypted data quick as you like.

Application-Layer Defences: Field Encryption If you want to take the database out of the threat landscape, you really need to encrypt sensitive data before it even gets near the database. This is the realm of field encryption, more commonly known as application-level encryption. This technique involves encrypting each field of data before it is sent to be stored in the database, and then decrypting it again after it s retrieved from the database. Anyone who gets the data from the database directly, whether via a backup or a direct connection, is out of luck: they can t decrypt the data, and therefore it s worthless. There are, of course, some limitations of this technique. For starters, every ORM and data mapper out there has rolled their own encryption format, meaning that there s basically zero interoperability. This isn t a problem if you build everything that accesses the database using a single framework, but if you ever feel the need to migrate, or use the database from multiple codebases, you re likely in for a rough time. The other big problem of traditional application-level encryption is that, when the database can t understand what data its storing, it can t run queries against that data. So if you want to encrypt, say, your users dates of birth, but you also need to be able to query on that field, you need to choose between one or the other: you can t have both at the same time. You may think to yourself, but this isn t any good, an attacker that breaks into my application can still steal all my data! . That is true, but security is never binary. The name of the game is reducing the attack surface, making it harder for an attacker to succeed. If you leave all the data unencrypted in the database, an attacker can steal all your data by breaking into the database or by breaking into the application. Encrypting the data reduces the attacker s options, and allows you to focus your resources on hardening the application against attack, safe in the knowledge that an attacker who gets into the database directly isn t going to get anything valuable.

Sidenote: The Curious Case of `pg_crypto` PostgreSQL ships a contrib module called `pg_crypto`, which provides encryption and decryption functions. This sounds ideal to use for encrypting data within our applications, as it s available no matter what we re using to write our application. It avoids the problem of framework-specific cryptography, because you call the same PostgreSQL functions no matter what language you re using, which produces the same output. However, I don t recommend ever using `pg_crypto` s data encryption functions, and I doubt you will find many other cryptographic engineers who will, either. First up, and most horrifyingly, it requires you to pass the long-term keys to the database server. If there s an attacker actively in the database server, they can capture the keys as they come in, which means all the data encrypted using that key is exposed. Sending the keys can also result in the keys ending up in query logs, both on the client and server, which is obviously a terrible result. Less scary, but still very concerning, is that `pg_crypto` s available cryptography is, to put it mildly, antiquated. We have a lot of newer, safer, and faster techniques for data encryption, that aren t available in `pg_crypto`. This means that if you do use it, you re leaving a lot on the table, and need to have skilled cryptographic engineers on hand to avoid the potential pitfalls. In short: friends don t let friends use `pg_crypto`.

The Future: Enquo All this brings us to the project I run: Enquo. It takes application-layer encryption to a new level, by providing a language- and framework-agnostic cryptosystem that also enables encrypted data to be efficiently queried by the database. So, you can encrypt your users dates of birth, in such a way that anyone with the appropriate keys can query the database to return, say, all users over the age of 18, but an attacker just sees unintelligible gibberish. This should greatly increase the amount of data that can be encrypted, and as the Enquo project expands its available data types and supported languages, the coverage of encrypted data will grow and grow. My eventual goal is to encrypt all data, all the time. If this appeals to you, visit enquo.org to use or contribute to the open source project, or EnquoDB.com for commercial support and hosted database options.

1 November 2023

Matthew Garrett: Why ACPI?

"Why does ACPI exist" - - the greatest thread in the history of forums, locked by a moderator after 12,239 pages of heated debate, wait no let me start again.

Why does ACPI exist? In the beforetimes power management on x86 was done by jumping to an opaque BIOS entry point and hoping it would do the right thing. It frequently didn't. We called this Advanced Power Management (Advanced because before this power management involved custom drivers for every machine and everyone agreed that this was a bad idea), and it involved the firmware having to save and restore the state of every piece of hardware in the system. This meant that assumptions about hardware configuration were baked into the firmware - failed to program your graphics card exactly the way the BIOS expected? Hurrah! It's only saved and restored a subset of the state that you configured and now potential data corruption for you. The developers of ACPI made the reasonable decision that, well, maybe since the OS was the one setting state in the first place, the OS should restore it.

So far so good. But some state is fundamentally device specific, at a level that the OS generally ignores. How should this state be managed? One way to do that would be to have the OS know about the device specific details. Unfortunately that means you can't ship the computer without having OS support for it, which means having OS support for every device (exactly what we'd got away from with APM). This, uh, was not an option the PC industry seriously considered. The alternative is that you ship something that abstracts the details of the specific hardware and makes that abstraction available to the OS. This is what ACPI does, and it's also what things like Device Tree do. Both provide static information about how the platform is configured, which can then be consumed by the OS and avoid needing device-specific drivers or configuration to be built-in.

The main distinction between Device Tree and ACPI is that Device Tree is purely a description of the hardware that exists, and so still requires the OS to know what's possible - if you add a new type of power controller, for instance, you need to add a driver for that to the OS before you can express that via Device Tree. ACPI decided to include an interpreted language to allow vendors to expose functionality to the OS without the OS needing to know about the underlying hardware. So, for instance, ACPI allows you to associate a device with a function to power down that device. That function may, when executed, trigger a bunch of register accesses to a piece of hardware otherwise not exposed to the OS, and that hardware may then cut the power rail to the device to power it down entirely. And that can be done without the OS having to know anything about the control hardware.

How is this better than just calling into the firmware to do it? Because the fact that ACPI declares that it's going to access these registers means the OS can figure out that it shouldn't, because it might otherwise collide with what the firmware is doing. With APM we had no visibility into that - if the OS tried to touch the hardware at the same time APM did, boom, almost impossible to debug failures (This is why various hardware monitoring drivers refuse to load by default on Linux - the firmware declares that it's going to touch those registers itself, so Linux decides not to in order to avoid race conditions and potential hardware damage. In many cases the firmware offers a collaborative interface to obtain the same data, and a driver can be written to get that. this bug comment discusses this for a specific board)

Unfortunately ACPI doesn't entirely remove opaque firmware from the equation - ACPI methods can still trigger System Management Mode, which is basically a fancy way to say "Your computer stops running your OS, does something else for a while, and you have no idea what". This has all the same issues that APM did, in that if the hardware isn't in exactly the state the firmware expects, bad things can happen. While historically there were a bunch of ACPI-related issues because the spec didn't define every single possible scenario and also there was no conformance suite (eg, should the interpreter be multi-threaded? Not defined by spec, but influences whether a specific implementation will work or not!), these days overall compatibility is pretty solid and the vast majority of systems work just fine - but we do still have some issues that are largely associated with System Management Mode.

One example is a recent Lenovo one, where the firmware appears to try to poke the NVME drive on resume. There's some indication that this is intended to deal with transparently unlocking self-encrypting drives on resume, but it seems to do so without taking IOMMU configuration into account and so things explode. It's kind of understandable why a vendor would implement something like this, but it's also kind of understandable that doing so without OS cooperation may end badly.

This isn't something that ACPI enabled - in the absence of ACPI firmware vendors would just be doing this unilaterally with even less OS involvement and we'd probably have even more of these issues. Ideally we'd "simply" have hardware that didn't support transitioning back to opaque code, but we don't (ARM has basically the same issue with TrustZone). In the absence of the ideal world, by and large ACPI has been a net improvement in Linux compatibility on x86 systems. It certainly didn't remove the "Everything is Windows" mentality that many vendors have, but it meant we largely only needed to ensure that Linux behaved the same way as Windows in a finite number of ways (ie, the behaviour of the ACPI interpreter) rather than in every single hardware driver, and so the chances that a new machine will work out of the box are much greater than they were in the pre-ACPI period.

There's an alternative universe where we decided to teach the kernel about every piece of hardware it should run on. Fortunately (or, well, unfortunately) we've seen that in the ARM world. Most device-specific simply never reaches mainline, and most users are stuck running ancient kernels as a result. Imagine every x86 device vendor shipping their own kernel optimised for their hardware, and now imagine how well that works out given the quality of their firmware. Does that really seem better to you?

It's understandable why ACPI has a poor reputation. But it's also hard to figure out what would work better in the real world. We could have built something similar on top of Open Firmware instead but the distinction wouldn't be terribly meaningful - we'd just have Forth instead of the ACPI bytecode language. Longing for a non-ACPI world without presenting something that's better and actually stands a reasonable chance of adoption doesn't make the world a better place.

comments

12 October 2023

Matthew Garrett: Defending abuse does not defend free software

The Free Software Foundation Europe and the Software Freedom Conservancy recently released a statement that they would no longer work with Eben Moglen, chairman of the Software Freedom Law Center. Eben was the general counsel for the Free Software Foundation for over 20 years, and was centrally involved in the development of version 3 of the GNU General Public License. He's devoted a great deal of his life to furthering free software.

But, as described in the joint statement, he's also acted abusively towards other members of the free software community. He's acted poorly towards his own staff. In a professional context, he's used graphically violent rhetoric to describe people he dislikes. He's screamed abuse at people attempting to do their job.

And, sadly, none of this comes as a surprise to me. As I wrote in 2017, after it became clear that Eben's opinions diverged sufficiently from the FSF's that he could no longer act as general counsel, he responded by threatening an FSF board member at an FSF-run event (various members of the board were willing to tolerate this, which is what led to me quitting the board). There's over a decade's evidence of Eben engaging in abusive behaviour towards members of the free software community, be they staff, colleagues, or just volunteers trying to make the world a better place.

When we build communities that tolerate abuse, we exclude anyone unwilling to tolerate being abused[1]. Nobody in the free software community should be expected to deal with being screamed at or threatened. Nobody should be afraid that they're about to have their sexuality outed by a former boss.

But of course there are some that will defend Eben based on his past contributions. There were people who were willing to defend Hans Reiser on that basis. We need to be clear that what these people are defending is not free software - it's the right for abusers to abuse. And in the long term, that's bad for free software.

[1] "Why don't people just get better at tolerating abuse?" is a terrible response to this. Why don't abusers stop abusing? There's fewer of them, and it should be easier.

comments

30 September 2023

Ian Jackson: DKIM: rotate and publish your keys

If you are an email system administrator, you are probably using DKIM to sign your outgoing emails. You should be rotating the key regularly and automatically, and publishing old private keys. I have just released dkim-rotate 1.0; dkim-rotate is a tool to do this key rotation and publication. If you are an email user, your email provider ought to be doing this. If this is not done, your emails are non-repudiable , meaning that if they are leaked, anyone (eg, journalists, haters) can verify that they are authentic, and prove that to others. This is not desirable (for you). Non-repudiation of emails is undesirable This problem was described at some length in Matthew Green s article Ok Google: please publish your DKIM secret keys. Avoiding non-repudiation sounds a bit like lying. After all, I m advising creating a situation where some people can t verify that something is true, even though it is. So I m advocating casting doubt. Crucially, though, it s doubt about facts that ought to be private. When you send an email, that s between you and the recipient. Normally you don t intend for anyone, anywhere, who happens to get a copy, to be able to verify that it was really you that sent it. In practical terms, this verifiability has already been used by journalists to verify stolen emails. Associated Press provide a verification tool. Advice for all email users As a user, you probably don t want your emails to be non-repudiable. (Other people might want to be able to prove you sent some email, but your email system ought to serve your interests, not theirs.) So, your email provider ought to be rotating their DKIM keys, and publishing their old ones. At a rough guess, your provider probably isn t :-(. How to tell by looking at email headers A quick and dirty way to guess is to have a friend look at the email headers of a message you sent. (It is important that the friend uses a different email provider, since often DKIM signatures are not applied within a single email system.) If your friend sees a DKIM-Signature header then the message is DKIM signed. If they don t, then it wasn t. Most email traversing the public internet is DKIM signed nowadays; so if they don t see the header probably they re not looking using the right tools, or they re actually on the same email system as you. In messages signed by a system running dkim-rotate, there will also be a header about the key rotation, to notify potential verifiers of the situation. Other systems that avoid non-repudiation-through-DKIM might do something similar. dkim-rotate s header looks like this:

DKIM-Signature-Warning: NOTE REGARDING DKIM KEY COMPROMISE
 https://www.chiark.greenend.org.uk/dkim-rotate/README.txt
 https://www.chiark.greenend.org.uk/dkim-rotate/ae/aeb689c2066c5b3fee673355309fe1c7.pem

But an email system might do half of the job of dkim-rotate: regularly rotating the key would cause the signatures of old emails to fail to verify, which is a good start. In that case there probably won t be such a header. Testing verification of new and old messages You can also try verifying the signatures. This isn t entirely straightforward, especially if you don t have access to low-level mail tooling. Your friend will need to be able to save emails as raw whole headers and body, un-decoded, un-rendered. If your friend is using a traditional Unix mail program, they should save the message as an mbox file. Otherwise, ProPublica have instructions for attaching and transferring and obtaining the raw email. (Scroll down to How to Check DKIM and ARC .) Checking that recent emails are verifiable Firstly, have your friend test that they can in fact verify a DKIM signature. This will demonstrate that the next test, where the verification is supposed to fail, is working properly and fails for the right reasons. Send your friend a test email now, and have them do this on a Linux system:

    # save the message as test-email.mbox
    apt install libmail-dkim-perl # or equivalent on another distro
    dkimproxy-verify <test-email.mbox

You should see output containing something like this:

    originator address: ijackson@chiark.greenend.org.uk
    signature identity: @chiark.greenend.org.uk
    verify result: pass
    ...

If the output ontains verify result: fail (body has been altered) then probably your friend didn t manage to faithfully save the unalterered raw message. Checking old emails cannot be verified When you both have that working, have your friend find an older email of yours, from (say) month ago. Perform the same steps. Hopefully they will see something like this:

    originator address: ijackson@chiark.greenend.org.uk
    signature identity: @chiark.greenend.org.uk
    verify result: fail (bad RSA signature)

or maybe

    verify result: invalid (public key: not available)

This indicates that this old email can no longer be verified. That s good: it means that anyone who steals a copy, can t verify it either. If it s leaked, the journalist who receives it won t know it s genuine and unmodified; they should then be suspicious. If your friend sees verify result: pass, then they have verified that that old email of yours is genuine. Anyone who had a copy of the mail can do that. This is good for email thieves, but not for you. For email admins: announcing dkim-rotate 1.0 I have been running dkim-rotate 0.4 on my infrastructure, since last August. and I had entirely forgotten about it: it has run flawlessly for a year. I was reminded of the topic by seeing DKIM in other blog posts. Obviously, it is time to decreee that dkim-rotate is 1.0. If you re a mail system administrator, your users are best served if you use something like dkim-rotate. The package is available in Debian stable, and supports Exim out of the box, but other MTAs should be easy to support too, via some simple ad-hoc scripting. Limitation of this approach Even with this key rotation approach, emails remain nonrepudiable for a short period after they re sent - typically, a few days. Someone who obtains a leaked email very promptly, and shows it to the journalist (for example) right away, can still convince the journalist. This is not great, but at least it doesn t apply to the vast bulk of your email archive. There are possible email protocol improvements which might help, but they re quite out of scope for this article.

Edited 2023-10-01 00:20 +01:00 to fix some grammar

comments

13 September 2023

Matthew Garrett: Reconstructing an invalid TPM event log

TPMs contain a set of registers ("Platform Configuration Registers", or PCRs) that are used to track what a system boots. Each time a new event is measured, a cryptographic hash representing that event is passed to the TPM. The TPM appends that hash to the existing value in the PCR, hashes that, and stores the final result in the PCR. This means that while the PCR's value depends on the precise sequence and value of the hashes presented to it, the PCR value alone doesn't tell you what those individual events were. Different PCRs are used to store different event types, but there are still more events than there are PCRs so we can't avoid this problem by simply storing each event separately.

This is solved using the event log. The event log is simply a record of each event, stored in RAM. The algorithm the TPM uses to calculate the PCR values is known, so we can reproduce that by simply taking the events from the event log and replaying the series of events that were passed to the TPM. If the final calculated value is the same as the value in the PCR, we know that the event log is accurate, which means we now know the value of each individual event and can make an appropriate judgement regarding its security.

If any value in the event log is invalid, we'll calculate a different PCR value and it won't match. This isn't terribly helpful - we know that at least one entry in the event log doesn't match what was passed to the TPM, but we don't know which entry. That means we can't trust any of the events associated with that PCR. If you're trying to make a security determination based on this, that's going to be a problem.

PCR 7 is used to track information about the secure boot policy on the system. It contains measurements of whether or not secure boot is enabled, and which keys are trusted and untrusted on the system in question. This is extremely helpful if you want to verify that a system booted with secure boot enabled before allowing it to do something security or safety critical. Unfortunately, if the device gives you an event log that doesn't replay correctly for PCR 7, you now have no idea what the security state of the system is.

We ran into that this week. Examination of the event log revealed an additional event other than the expected ones - a measurement accompanied by the string "Boot Guard Measured S-CRTM". Boot Guard is an Intel feature where the CPU verifies the firmware is signed with a trusted key before executing it, and measures information about the firmware in the process. Previously I'd only encountered this as a measurement into PCR 0, which is the PCR used to track information about the firmware itself. But it turns out that at least some versions of Boot Guard also measure information about the Boot Guard policy into PCR 7. The argument for this is that this is effectively part of the secure boot policy - having a measurement of the Boot Guard state tells you whether Boot Guard was enabled, which tells you whether or not the CPU verified a signature on your firmware before running it (as I wrote before, I think Boot Guard has user-hostile default behaviour, and that enforcing this on consumer devices is a bad idea).

But there's a problem here. The event log is created by the firmware, and the Boot Guard measurements occur before the firmware is executed. So how do we get a log that represents them? That one's fairly simple - the firmware simply re-calculates the same measurements that Boot Guard did and creates a log entry after the fact[1]. All good.

Except. What if the firmware screws up the calculation and comes up with a different answer? The entry in the event log will now not match what was sent to the TPM, and replaying will fail. And without knowing what the actual value should be, there's no way to fix this, which means there's no way to verify the contents of PCR 7 and determine whether or not secure boot was enabled.

But there's still a fundamental source of truth - the measurement that was sent to the TPM in the first place. Inspired by Henri Nurmi's work on sniffing Bitlocker encryption keys, I asked a coworker if we could sniff the TPM traffic during boot. The TPM on the board in question uses SPI, a simple bus that can have multiple devices connected to it. In this case the system flash and the TPM are on the same SPI bus, which made things easier. The board had a flash header for external reprogramming of the firmware in the event of failure, and all SPI traffic was visible through that header. Attaching a logic analyser to this header made it simple to generate a record of that. The only problem was that the chip select line on the header was attached to the firmware flash chip, not the TPM. This was worked around by simply telling the analysis software that it should invert the sense of the chip select line, ignoring all traffic that was bound for the flash and paying attention to all other traffic. This worked in this case since the only other device on the bus was the TPM, but would cause problems in the event of multiple devices on the bus all communicating.

With the aid of this analyser plugin, I was able to dump all the TPM traffic and could then search for writes that included the "0182" sequence that corresponds to the command code for a measurement event. This gave me a couple of accesses to the locality 3 registers, which was a strong indication that they were coming from the CPU rather than from the firmware. One was for PCR 0, and one was for PCR 7. This corresponded to the two Boot Guard events that we expected from the event log. The hash in the PCR 0 measurement was the same as the hash in the event log, but the hash in the PCR 7 measurement differed from the hash in the event log. Replacing the event log value with the value actually sent to the TPM resulted in the event log now replaying correctly, supporting the hypothesis that the firmware was failing to correctly reconstruct the event.

What now? The simple thing to do is for us to simply hard code this fixup, but longer term we'd like to figure out how to reconstruct the event so we can calculate the expected value ourselves. Unfortunately there doesn't seem to be any public documentation on this. Sigh.

[1] What stops firmware on a system with no Boot Guard faking those measurements? TPMs have a concept of "localities", effectively different privilege levels. When Boot Guard performs its initial measurement into PCR 0, it does so at locality 3, a locality that's only available to the CPU. This causes PCR 0 to be initialised to a different initial value, affecting the final PCR value. The firmware can't access locality 3, so can't perform an equivalent measurement, so can't fake the value.

comments

29 August 2023

Matthew Garrett: Unix sockets, Cygwin, SSH agents, and sadness

Work involves supporting Windows (there's a lot of specialised hardware design software that's only supported under Windows, so this isn't really avoidable), but also involves git, so I've been working on extending our support for hardware-backed SSH certificates to Windows and trying to glue that into git. In theory this doesn't sound like a hard problem, but in practice oh good heavens.

Git for Windows is built on top of msys2, which in turn is built on top of Cygwin. This is an astonishing artifact that allows you to build roughly unmodified POSIXish code on top of Windows, despite the terrible impedance mismatches inherent in this. One is that until 2017, Windows had no native support for Unix sockets. That's kind of a big deal for compatibility purposes, so Cygwin worked around it. It's, uh, kind of awful. If you're not a Cygwin/msys app but you want to implement a socket they can communicate with, you need to implement this undocumented protocol yourself. This isn't impossible, but ugh.

But going to all this trouble helps you avoid another problem! The Microsoft version of OpenSSH ships an SSH agent that doesn't use Unix sockets, but uses a named pipe instead. So if you want to communicate between Cygwinish OpenSSH (as is shipped with git for Windows) and the SSH agent shipped with Windows, you need something that bridges between those. The state of the art seems to be to use npiperelay with socat, but if you're already writing something that implements the Cygwin socket protocol you can just use npipe to talk to the shipped ssh-agent and then export your own socket interface.

And, amazingly, this all works? I've managed to hack together an SSH agent (using Go's SSH agent implementation) that can satisfy hardware backed queries itself, but forward things on to the Windows agent for compatibility with other tooling. Now I just need to figure out how to plumb it through to WSL. Sigh.

comments

25 August 2023

Ian Jackson: I cycled to all the villages in alphabetical order

This last weekend I completed a bike rides project I started during the first Covid lockdown in 2020: I ve cycled to every settlement (and radio observatory) within 20km of my house, in alphabetical order. Stir crazy In early 2020, during the first lockdown, I was going a bit stir crazy. Clare said you re going very strange, you have to go out and get some exercise . After a bit of discussion, we came up with this plan: I d visit all the local villages, in alphabetical order. Choosing the radius I decided that I would pick a round number of kilometers, as the crow flies, from my house. 20km seemed about right. 25km would have included Ely, which would have been nice, but it would have added a great many places, all of them quite distant. Software I wrote a short Rust program to process OSM data into a list of places to visit, and their distances and bearings. You can download a tarball of the alphabetical villages scanner. (I haven t published the git history because it has my house s GPS coordinates in it, and because I committed the output files from which that location can be derived.) The Rides I set off on my first ride, to Aldreth, on Sunday the 31st of May 2020. The final ride collected Yelling, on Saturday the 19th of August 2023. I did quite a few rides in June and July 2020 - more than one a week. (I d read the lockdown rules, and although some of the government messaging said you should stay near your house, that wasn t in the legislation. Of course I didn t go into any buildings or anything.) I m not much of a morning person, so I often set off after lunch. For the longer rides I would usually pack a picnic. Almost all of the rides I did just by myself. There were a handful where I had friends along: Dry Drayton, which I collected with Clare, at night. I held my bike up so the light shone at the village sign, so we could take a photo of it. Madingley, Melbourn and Meldreth, which was quite an expedition with my friend Ben. We went out as far as Royston and nearby Barley (both outside my radius and not on my list) mostly just so that my project would have visited Hertfordshire. The Hemingfords, where I had my friend Matthew along, and we had a very nice pub lunch. Girton and Wilburton, where I visited friends. Indeed, I stopped off in Wilburton on one or two other occasions. And, of course, Yelling, for which there were four of us, again with a nice lunch (in Eltisley). I had relatively little mechanical trouble. My worst ride for this was Exning: I got three punctures that day. Luckily the last one was close to home. I often would stop to take lots of photos en-route. My mum in particular appreciated all the pretty pictures. Rules I decided on these rules: I would cycle to each destination, in order, and it would count as collected if I rode both there and back. I allowed collecting multiple villages in the same outing, provided I did them in the right order. (And obviously I was allowed to pass through places out of order, without counting them.) I tried to get a picture of the village sign, where there was one. Failing that, I got a picture of something in the village with the village s name on it. I think the only one I didn t manage this for was Westley Bottom; I had to make do with the word Westley on some railway level crossing equipment. In Barway I had to make do with a planning application, stuck to a pole. I tried not to enter and leave a village by the same road, if possible. Edge cases I had to make some decisions: I decided that I would consider the project complete if I visited everywhere whose centre was within my radius. But the centre of a settlement is rather hard to define. I needed a hard criterion for my OpenStreetMap data mining: a place counted if there was any node, way or relation, with the relevant place tag, any part of which was within my ambit. That included some places that probably oughtn t to have counted, but, fine. I also decided that I wouldn t visit suburbs of Cambridge, separately from Cambridge itself. I don t consider them separate settlements, at least, not if they re conurbated with Cambridge. So that excluded Trumpington, for example. But I decided that Girton and Fen Ditton were (just) separable. Although the place where I consider Girton and Cambridge to nearly touch, is administratively well inside Girton, I chose to look at land use (on the ground, and in OSM data), rather than administrative boundaries. But I did visit both Histon and Impington, and all each of the Shelfords and Stapleford, as separate entries in my list. Mostly because otherwise I d have to decide whether to skip (say) Impington, or Histon. Whereas skipping suburbs of Cambridge in favour of Cambridge itself was an easy decision, and it also got rid of a bunch of what would have been quite short, boring, urban expeditions. I sorted all the Greats and Littles under G and L, rather than (say) Shelford, Great , which seemed like it would be cheating because then I would be able to do Shelford, Great and Shelford, Little in one go. Northstowe turned from mostly a building site into something that was arguably a settlement, during my project. It wasn t included in the output of my original data mining. Of course it s conurbated with Oakington - but happily, Northstowe inserts right before Oakington in the alphabetical list, so I decided to add it, visiting both the old and new in the same day. There are a bunch of other minor edge cases. Some villages have an outlying hamlet. Mostly I included these. There are some individual farms, which I generally didn t count. Some stats I visited 150 villages plus the Lords Bridge radio observatory. The project took 3 years and 3 months to complete. There were 96 rides, totalling about 4900km. So my mean distance was around 51km. The median distance per ride was a little higher, at around 52 km, and the median duration (including stoppages) was about 2h40. The total duration, if you add them all up, including stoppages, was about 275h, giving a mean speed including photo stops, lunches and all, of 18kph. The longest ride was 89.8km, collecting Scotland Farm, Shepreth, and Six Mile Bottom, so riding across the Cam valley. The shortest ride was 7.9km, collecting Cambridge (obviously); and I think that s the only one I did on my Brompton. The rest were all on my trusty Thorn Audax. My fastest ride (ranking by distance divided by time spent in motion) was to collect Haddenham, where I covered 46.3km in 1h39, giving an average speed in motion of 28.0kph. The most I collected in one day was 5 places: West Wickham, West Wratting, Westley Bottom, Westley Waterless, and Weston Colville. That was the day of the Wests. (There s only one East: East Hatley.) Map Here is a pretty picture of all of my tracklogs:

Edited 2023-08-25 01:32 BST to correct a slip.

comments

8 August 2023

Matthew Garrett: Updating Fedora the unsupported way

I dug out a computer running Fedora 28, which was released 2018-04-01 - over 5 years ago. Backing up the data and re-installing seemed tedious, but the current version of Fedora is 38, and while Fedora supports updates from N to N+2 that was still going to be 5 separate upgrades. That seemed tedious, so I figured I'd just try to do an update from 28 directly to 38. This is, obviously, extremely unsupported, but what could possibly go wrong?

Running sudo dnf system-upgrade download --releasever=38 didn't successfully resolve dependencies, but sudo dnf system-upgrade download --releasever=38 --allowerasing passed and dnf started downloading 6GB of packages. And then promptly failed, since I didn't have any of the relevant signing keys. So I downloaded the fedora-gpg-keys package from F38 by hand and tried to install it, and got a signature hdr data: BAD, no. of bytes(88084) out of range error. It turns out that rpm doesn't handle cases where the signature header is larger than a few K, and RPMs from modern versions of Fedora. The obvious fix would be to install a newer version of rpm, but that wouldn't be easy without upgrading the rest of the system as well - or, alternatively, downloading a bunch of build depends and building it. Given that I'm already doing all of this in the worst way possible, let's do something different.

The relevant code in the hdrblobRead function of rpm's lib/header.c is:


    int32_t il_max = HEADER_TAGS_MAX;
    int32_t dl_max = HEADER_DATA_MAX;

    if (regionTag == RPMTAG_HEADERSIGNATURES)  
        il_max = 32;
        dl_max = 8192;

which indicates that if the header in question is RPMTAG_HEADERSIGNATURES, it sets more restrictive limits on the size (no, I don't know why). So I installed rpm-libs-debuginfo, ran gdb against librpm.so.8, loaded the symbol file, and then did disassemble hdrblobRead. The relevant chunk ends up being:


   0x000000000001bc81 <+81>:	cmp    $0x3e,%ebx
   0x000000000001bc84 <+84>:	mov    $0xfffffff,%ecx
   0x000000000001bc89 <+89>:	mov    $0x2000,%eax
   0x000000000001bc8e <+94>:	mov    %r12,%rdi
   0x000000000001bc91 <+97>:	cmovne %ecx,%eax

which is basically "If ebx is not 0x3e, set eax to 0xffffffff - otherwise, set it to 0x2000". RPMTAG_HEADERSIGNATURES is 62, which is 0x3e, so I just opened librpm.so.8 in hexedit, went to byte 0x1bc81, and replaced 0x3e with 0xfe (an arbitrary invalid value). This has the effect of skipping the if (regionTag == RPMTAG_HEADERSIGNATURES) code and so using the default limits even if the header section in question is the signatures. And with that one byte modification, rpm from F28 would suddenly install the fedora-gpg-keys package from F38. Success!

But short-lived. dnf now believed packages had valid signatures, but sadly there were still issues. A bunch of packages in F38 had files that conflicted with packages in F28. These were largely Python 3 packages that conflicted with Python 2 packages from F28 - jumping this many releases meant that a bunch of explicit replaces and the like no longer existed. The easiest way to solve this was simply to uninstall python 2 before upgrading, and avoiding the entire transition. Another issue was that some data files had moved from libxcrypt-common to libxcrypt, and removing libxcrypt-common would remove libxcrypt and a bunch of important things that depended on it (like, for instance, systemd). So I built a fake empty package that provided libxcrypt-common and removed the actual package. Surely everything would work now?

Ha no. The final obstacle was that several packages depended on rpmlib(CaretInVersions), and building another fake package that provided that didn't work. I shouted into the void and Bill Nottingham answered - rpmlib dependencies are synthesised by rpm itself, indicating that it has the ability to handle extensions that specific packages are making use of. This made things harder, since the list is hard-coded in the binary. But since I'm already committing crimes against humanity with a hex editor, why not go further? Back to editing librpm.so.8 and finding the list of rpmlib() dependencies it provides. There were a bunch, but I couldn't really extend the list. What I could do is overwrite existing entries. I tried this a few times but (unsurprisingly) broke other things since packages depended on the feature I'd overwritten. Finally, I rewrote rpmlib(ExplicitPackageProvide) to rpmlib(CaretInVersions) (adding an extra '\0' at the end of it to deal with it being shorter than the original string) and apparently nothing I wanted to install depended on rpmlib(ExplicitPackageProvide) because dnf finished its transaction checks and prompted me to reboot to perform the update. So, I did.

And about an hour later, it rebooted and gave me a whole bunch of errors due to the fact that dbus never got started. A bit of digging revealed that I had no /etc/systemd/system/dbus.service, a symlink that was presumably introduced at some point between F28 and F38 but which didn't get automatically added in my case because well who knows. That was literally the only thing I needed to fix up after the upgrade, and on the next reboot I was presented with a gdm prompt and had a fully functional F38 machine.

You should not do this. I should not do this. This was a terrible idea. Any situation where you're binary patching your package manager to get it to let you do something is obviously a bad situation. And with hindsight performing 5 independent upgrades might have been faster. But that would have just involved me typing the same thing 5 times, while this way I learned something. And what I learned is "Terrible ideas sometimes work and so you should definitely act upon them rather than doing the sensible thing", so like I said, you should not do this in case you learn the same lesson.

comments

11 July 2023

Matthew Garrett: Roots of Trust are difficult

The phrase "Root of Trust" turns up at various points in discussions about verified boot and measured boot, and to a first approximation nobody is able to give you a coherent explanation of what it means[1]. The Trusted Computing Group has a fairly wordy definition, but (a) it's a lot of words and (b) I don't like it, so instead I'm going to start by defining a root of trust as "A thing that has to be trustworthy for anything else on your computer to be trustworthy".

(An aside: when I say "trustworthy", it is very easy to interpret this in a cynical manner and assume that "trust" means "trusted by someone I do not necessarily trust to act in my best interest". I want to be absolutely clear that when I say "trustworthy" I mean "trusted by the owner of the computer", and that as far as I'm concerned selling devices that do not allow the owner to define what's trusted is an extremely bad thing in the general case)

Let's take an example. In verified boot, a cryptographic signature of a component is verified before it's allowed to boot. A straightforward implementation of a verified boot implementation has the firmware verify the signature on the bootloader or kernel before executing it. In this scenario, the firmware is the root of trust - it's the first thing that makes a determination about whether something should be allowed to run or not[2]. As long as the firmware behaves correctly, and as long as there aren't any vulnerabilities in our boot chain, we know that we booted an OS that was signed with a key we trust.

But what guarantees that the firmware behaves correctly? What if someone replaces our firmware with firmware that trusts different keys, or hot-patches the OS as it's booting it? We can't just ask the firmware whether it's trustworthy - trustworthy firmware will say yes, but the thing about malicious firmware is that it can just lie to us (either directly, or by modifying the OS components it boots to lie instead). This is probably not sufficiently trustworthy!

Ok, so let's have the firmware be verified before it's executed. On Intel this is "Boot Guard", on AMD this is "Platform Secure Boot", everywhere else it's just "Secure Boot". Code on the CPU (either in ROM or signed with a key controlled by the CPU vendor) verifies the firmware[3] before executing it. Now the CPU itself is the root of trust, and, well, that seems reasonable - we have to place trust in the CPU, otherwise we can't actually do computing. We can now say with a reasonable degree of confidence (again, in the absence of vulnerabilities) that we booted an OS that we trusted. Hurrah!

Except. How do we know that the CPU actually did that verification? CPUs are generally manufactured without verification being enabled - different system vendors use different signing keys, so those keys can't be installed in the CPU at CPU manufacture time, and vendors need to do code development without signing everything so you can't require that keys be installed before a CPU will work. So, out of the box, a new CPU will boot anything without doing verification[4], and development units will frequently have no verification.

As a device owner, how do you tell whether or not your CPU has this verification enabled? Well, you could ask the CPU, but if you're doing that on a device that booted a compromised OS then maybe it's just hotpatching your OS so when you do that you just get RET_TRUST_ME_BRO even if the CPU is desperately waving its arms around trying to warn you it's a trap. This is, unfortunately, a problem that's basically impossible to solve using verified boot alone - if any component in the chain fails to enforce verification, the trust you're placing in the chain is misplaced and you are going to have a bad day.

So how do we solve it? The answer is that we can't simply ask the OS, we need a mechanism to query the root of trust itself. There's a few ways to do that, but fundamentally they depend on the ability of the root of trust to provide proof of what happened. This requires that the root of trust be able to sign (or cause to be signed) an "attestation" of the system state, a cryptographically verifiable representation of the security-critical configuration and code. The most common form of this is called "measured boot" or "trusted boot", and involves generating a "measurement" of each boot component or configuration (generally a cryptographic hash of it), and storing that measurement somewhere. The important thing is that it must not be possible for the running OS (or any pre-OS component) to arbitrarily modify these measurements, since otherwise a compromised environment could simply go back and rewrite history. One frequently used solution to this is to segregate the storage of the measurements (and the attestation of them) into a separate hardware component that can't be directly manipulated by the OS, such as a Trusted Platform Module. Each part of the boot chain measures relevant security configuration and the next component before executing it and sends that measurement to the TPM, and later the TPM can provide a signed attestation of the measurements it was given. So, an SoC that implements verified boot should create a measurement telling us whether verification is enabled - and, critically, should also create a measurement if it isn't. This is important because failing to measure the disabled state leaves us with the same problem as before; someone can replace the mutable firmware code with code that creates a fake measurement asserting that verified boot was enabled, and if we trust that we're going to have a bad time.

(Of course, simply measuring the fact that verified boot was enabled isn't enough - what if someone replaces the CPU with one that has verified boot enabled, but trusts keys under their control? We also need to measure the keys that were used in order to ensure that the device trusted only the keys we expected, otherwise again we're going to have a bad time)

So, an effective root of trust needs to:

1) Create a measurement of its verified boot policy before running any mutable code
2) Include the trusted signing key in that measurement
3) Actually perform that verification before executing any mutable code

and from then on we're in the hands of the verified code actually being trustworthy, and it's probably written in C so that's almost certainly false, but let's not try to solve every problem today.

Does anything do this today? As far as I can tell, Intel's Boot Guard implementation does. Based on publicly available documentation I can't find any evidence that AMD's Platform Secure Boot does (it does the verification, but it doesn't measure the policy beforehand, so it seems spoofable), but I could be wrong there. I haven't found any general purpose non-x86 parts that do, but this is in the realm of things that SoC vendors seem to believe is some sort of value-add that can only be documented under NDAs, so please do prove me wrong. And then there are add-on solutions like Titan, where we delegate the initial measurement and validation to a separate piece of hardware that measures the firmware as the CPU reads it, rather than requiring that the CPU do it.

But, overall, the situation isn't great. On many platforms there's simply no way to prove that you booted the code you expected to boot. People have designed elaborate security implementations that can be bypassed in a number of ways.

[1] In this respect it is extremely similar to "Zero Trust"
[2] This is a bit of an oversimplification - once we get into dynamic roots of trust like Intel's TXT this story gets more complicated, but let's stick to the simple case today
[3] I'm kind of using "firmware" in an x86ish manner here, so for embedded devices just think of "firmware" as "the first code executed out of flash and signed by someone other than the SoC vendor"
[4] In the Intel case this isn't strictly true, since the keys are stored in the motherboard chipset rather than the CPU, and so taking a board with Boot Guard enabled and swapping out the CPU won't disable Boot Guard because the CPU reads the configuration from the chipset. But many mobile Intel parts have the chipset in the same package as the CPU, so in theory swapping out that entire package would disable Boot Guard. I am not good enough at soldering to demonstrate that.

comments

12 June 2023

Matthew Palmer: Private Key Redaction: Redux

[Note: the original version of this post named the author of the referenced blog post, and the tone of my writing could be construed to be mocking or otherwise belittling them. While that was not my intention, I recognise that was a possible interpretation, and I have revised this post to remove identifying information and try to neutralise the tone. On the other hand, I have kept the identifying details of the domain involved, as there are entirely legitimate security concerns that result from the issues discussed in this post.] I have spoken before about why it is tricky to redact private keys. Although that post demonstrated a real-world, presumably-used-in-the-wild private key, I ve been made aware of commentary along the lines of this representative sample:

I find it hard to believe that anyone would take their actual production key and redact it for documentation. Does the author have evidence of this in practice, or did they see example keys and assume they were redacted production keys?

Well, buckle up, because today s post is another real-world case study, with rather higher stakes than the previous example.

When Helping Hurts Today s case study begins with someone who attempted to do a very good thing: they wrote a blog post about using HashiCorp Vault to store certificates and their private keys. In his post, they included some test data, a certificate and a private key, which they redacted. Unfortunately, they did not redact these very well. Each base64 blob has had one line replaced with all `x`s. Based on the steps I explained previously, it is relatively straightforward to retrieve the entire, intact private key.

From Bad to OMFG Now, if this post author had, say, generated a fresh private key (after all, there s no shortage of possible keys), that would not be worthy of a blog post. As you may surmise, that is not what happened. After reconstructing the insufficiently-redacted private key, you end up with a key that has a SHA256 fingerprint (in hex) of: `72bef096997ec59a671d540d75bd1926363b2097eb9fe10220b2654b1f665b54` Searching for certificates which use that key fingerprint, we find one result: a certificate for `hiltonhotels.jp` (and a bunch of other, related, domains, as subjectAltNames). As of the time of writing, that certificate is not marked as revoked, and appears to be the same certificate that is currently presented to visitors of that site. This is, shall we say, not great. Anyone in possession of this private key which, I should emphasise, has presumably been public information since the post s publication date of February 2023 has the ability to completely transparently impersonate the sites listed in that certificate. That would provide an attacker with the ability to capture any data a user entered, such as personal information, passwords, or payment details, and also modify what the user s browser received, including injecting malware or other unpleasantness. In short, no good deed goes unpunished, and this attempt to educate the world at large about the benefits of secure key storage has instead published private key material. Remember, kids: friends don t let friends post redacted private keys to the Internet.

Next.