Matthieu Caneill: Debsources, python3, and funky file names
Rumors are running that python2 is not a thing anymore.
Well, I'm certainly late to the party, but I'm happy to report that
sources.debian.org is now running python3.
Wait, it wasn't?
Back when development started, python3 was very much a real language, but it was
hard to adopt because it was not supported by many libraries. So python2 was
chosen, meaning
Running this on the Debsources main instance, which hosts pretty much all Debian
packages that were part of a Debian release, I could find 307 files (among a
total of almost 100 million files).
Without looking deep into them, they seem to fall into 2 categories:
print
-based debugging was used in lieu of print()
-based
debugging, and str
were bytes
, not unicode
.
And things were working just fine. One day python2 EOL was announced, with a
date far in the future. Far enough to procrastinate for a long time. Combine
this with a codebase that is stable enough to not see many commits, and the fact
that Debsources is a volunteer-based project that happens at best on week-ends,
and you end up with a dormant software and a missed deadline.
But, as dormant as the codebase is, the instance hosted at
sources.debian.org is very popular and gets 200k
to 500k hits per day. Largely enough to be worth a proper maintenance and a
transition to python3.
Funky file names
While transitioning to python3 and juggling left and right with str
, bytes
and unicode
for internal objects, files, database entries and HTTP content, I
stumbled upon a bug that has been there since day 1.
Quick recap if you're unfamiliar with this tool: Debsources displays the content
of the source packages in the Debian archive. In other words, it's a bit like
GitHub, but for the Debian source code.
And some pieces of software out there, that ended up in Debian packages, happen
to contain files whose names can't be decoded to UTF-8. Interestingly enough,
there's no such thing as a standard for file names: with a few exceptions that
vary by operating system, any sequence of bytes can be a legit file name. And
some sequences of bytes are not valid UTF-8.
Of course those files are rare, and using ASCII characters to name a file is a
much more common practice than using bytes in a non-UTF-8 character
encoding. But when you deal with almost 100 million files on which you have no
control (those files come from free software projects, and make their way into
Debian without any renaming), it happens.
Now back to the bug: when trying to display such a file through the web
interface, it would crash because it can't convert the file name to UTF-8, which
is needed for the HTML representation of the page.
Bugfix
An often valid approach when trying to represent invalid UTF-8 content is to
ignore errors, and replace them with ?
or
. This is what Debsources
actually does to display non-UTF-8 file content.
Unfortunately, this best-effort approach is not suitable for file names, as file
names are also identifiers in Debsources: among other places, they are part of
URLs. If an URL were to use placeholder characters to replace those bytes, there
would be no deterministic way to match it with a file on disk anymore.
The representation of binary data into text is a known problem. Multiple
lossless solutions exist, such as base64 and its variants, but URLs looking like
https://sources.debian.org/src/Y293c2F5LzMuMDMtOS4yL2Nvd3NheS8=
are not
readable at all compared to
https://sources.debian.org/src/cowsay/3.03-9.2/cowsay/
. Plus, not
backwards-compatible with all existing links.
The solution I chose is to use double-percent encoding: this allows the representation of any byte in an
URL, while keeping allowed characters unchanged - and preventing CGI gateways
from trying to decode non-UTF-8 bytes. This is the best of both worlds: regular
file names get to appear normally and are human-readable, and funky file names
only have percent signs and hex numbers where needed.
Here is an example of such an URL:
https://sources.debian.org/src/aspell-is/0.51-0-4/%25EDslenska.alias/. Notice
the %25ED
to represent the percentage symbol itself (%25
) followed by an
invalid UTF-8 byte (%ED
).
Transitioning to this was quite a challenge, as those file names don't only
appear in URLs, but also in web pages themselves, log files, database tables,
etc. And everything was done with str
: made sense in python2 when str
were
bytes
, but not much in python3.
What are those files? What's their network?
I was wondering too. Let's list them!
import os
with open('non-utf-8-paths.bin', 'wb') as f:
for root, folders, files in os.walk(b'/srv/sources.debian.org/sources/'):
for path in folders + files:
try:
path.decode('utf-8')
except UnicodeDecodeError:
f.write(root + b'/' + path + b'\n')
- File names that are not valid UTF-8, but are valid in a different charset. Not all software is developed in English or on UTF-8 systems.
- File names that can't be decoded to UTF-8 on purpose, to be used as input to test suites, and assert resilience of the software to non-UTF-8 data.