Search Results: "Matthieu Caneill"

23 January 2022

Matthieu Caneill: Debsources, python3, and funky file names

Rumors are running that python2 is not a thing anymore. Well, I'm certainly late to the party, but I'm happy to report that sources.debian.org is now running python3. Wait, it wasn't? Back when development started, python3 was very much a real language, but it was hard to adopt because it was not supported by many libraries. So python2 was chosen, meaning print-based debugging was used in lieu of print()-based debugging, and str were bytes, not unicode. And things were working just fine. One day python2 EOL was announced, with a date far in the future. Far enough to procrastinate for a long time. Combine this with a codebase that is stable enough to not see many commits, and the fact that Debsources is a volunteer-based project that happens at best on week-ends, and you end up with a dormant software and a missed deadline. But, as dormant as the codebase is, the instance hosted at sources.debian.org is very popular and gets 200k to 500k hits per day. Largely enough to be worth a proper maintenance and a transition to python3. Funky file names While transitioning to python3 and juggling left and right with str, bytes and unicode for internal objects, files, database entries and HTTP content, I stumbled upon a bug that has been there since day 1. Quick recap if you're unfamiliar with this tool: Debsources displays the content of the source packages in the Debian archive. In other words, it's a bit like GitHub, but for the Debian source code. And some pieces of software out there, that ended up in Debian packages, happen to contain files whose names can't be decoded to UTF-8. Interestingly enough, there's no such thing as a standard for file names: with a few exceptions that vary by operating system, any sequence of bytes can be a legit file name. And some sequences of bytes are not valid UTF-8. Of course those files are rare, and using ASCII characters to name a file is a much more common practice than using bytes in a non-UTF-8 character encoding. But when you deal with almost 100 million files on which you have no control (those files come from free software projects, and make their way into Debian without any renaming), it happens. Now back to the bug: when trying to display such a file through the web interface, it would crash because it can't convert the file name to UTF-8, which is needed for the HTML representation of the page. Bugfix An often valid approach when trying to represent invalid UTF-8 content is to ignore errors, and replace them with ? or . This is what Debsources actually does to display non-UTF-8 file content. Unfortunately, this best-effort approach is not suitable for file names, as file names are also identifiers in Debsources: among other places, they are part of URLs. If an URL were to use placeholder characters to replace those bytes, there would be no deterministic way to match it with a file on disk anymore. The representation of binary data into text is a known problem. Multiple lossless solutions exist, such as base64 and its variants, but URLs looking like https://sources.debian.org/src/Y293c2F5LzMuMDMtOS4yL2Nvd3NheS8= are not readable at all compared to https://sources.debian.org/src/cowsay/3.03-9.2/cowsay/. Plus, not backwards-compatible with all existing links. The solution I chose is to use double-percent encoding: this allows the representation of any byte in an URL, while keeping allowed characters unchanged - and preventing CGI gateways from trying to decode non-UTF-8 bytes. This is the best of both worlds: regular file names get to appear normally and are human-readable, and funky file names only have percent signs and hex numbers where needed. Here is an example of such an URL: https://sources.debian.org/src/aspell-is/0.51-0-4/%25EDslenska.alias/. Notice the %25ED to represent the percentage symbol itself (%25) followed by an invalid UTF-8 byte (%ED). Transitioning to this was quite a challenge, as those file names don't only appear in URLs, but also in web pages themselves, log files, database tables, etc. And everything was done with str: made sense in python2 when str were bytes, but not much in python3. What are those files? What's their network? I was wondering too. Let's list them!

import os
with open('non-utf-8-paths.bin', 'wb') as f:
    for root, folders, files in os.walk(b'/srv/sources.debian.org/sources/'):
        for path in folders + files:
            try:
                path.decode('utf-8')
            except UnicodeDecodeError:
                f.write(root + b'/' + path + b'\n')

Running this on the Debsources main instance, which hosts pretty much all Debian packages that were part of a Debian release, I could find 307 files (among a total of almost 100 million files). Without looking deep into them, they seem to fall into 2 categories:

File names that are not valid UTF-8, but are valid in a different charset. Not all software is developed in English or on UTF-8 systems.
File names that can't be decoded to UTF-8 on purpose, to be used as input to test suites, and assert resilience of the software to non-UTF-8 data.

That last point hits home, as it was clearly lacking in Debsources. A funky file name is now part of its test suite. ;)

18 November 2017

Matthieu Caneill: MiniDebconf in Toulouse

I attended the MiniDebconf in Toulouse, which was hosted in the larger Capitole du Libre, a free software event with talks, presentation of associations, and a keysigning party. I didn't expect the event to be that big, and I was very impressed by its organization. Cheers to all the volunteers, it has been an amazing week-end! Here's a sum-up of the talks I attended. Du logiciel libre la monnaie libre Speaker: lo s The first talk I attended was, translated to English, "from free software to free money". lo s compared the 4 freedoms of free software with money, and what properties money needs to exhibit in order to be considered free. He then introduced 1, a project of free (as in free speech!) money, started in the region around Toulouse. Contrary to some distributed ledgers such as Bitcoin, 1 isn't based on an hash-based proof-of-work, but rather around a web of trust of people certifying each other, hence limiting the energy consumption required by the network to function. YunoHost Speaker: Jimmy Monin I then attended a presentation of YunoHost. Being an happy user myself, it was very nice to discover the future expected features, and also meet two of the developers. YunoHost is a Debian-based project, aimed at providing all the tools necessary to self-host applications, including email, website, calendar, development tools, and dozens of other packages. Premiers pas dans l'univers de Debian Speaker: Nicolas Dandrimont For the first talk of the MiniDebConf, Nicolas Dandrimont introduced Debian, its philosophy, and how it works with regards to upstreams and downstreams. He gave many details on the teams, the infrastructure, and the internals of Debian. Trusting your computer and system Speaker: Jonas Smedegaard Jonas introduced some security concepts, and how they are abused and often meaningless (to quote his own words, "secure is bullshit"). He described a few projects which lean towards a more secure and open hardware, for both phones and laptops. Automatiser la gestion de configuration de Debian avec Ansible Speaker: J r my Lecour J r my, from Evolix, introduced Ansible, and how they use it to manage hundreds of Debian servers. Ansible is a very powerful tool, and a huge ecosystem, in many ways similar to Puppet or Chef, except it is agent-less, using only ssh connections to communicate with remote machines. Very nice to compare their use of Ansible with mine, since that's the software I use at work for deploying experiments. Making Debian for everybody Speaker: Samuel Thibault Samuel gave a talk about accessibility, and the general availability of the tools in today's operating systems, including Debian. The lesson to take home is that we often don't do enough in this domain, particularly when considering some issues people might have that we don't always think about. Accessibility on computers (and elsewhere) should be the default, and never require complex setups. Retour d'exp rience : mise jour de milliers de terminaux Debian Speaker: Cyril Brulebois Cyril described a problem he was hired for, an update of thousands of Debian servers from wheezy to jessie, which he discovered afterwards was worse than initially thought, since the machines were running the out-of-date squeeze. Since they were not always administered with the best sysadmin practices, they were all exhibiting different configurations and different packages lists, which raised many issues and gave him interesting challenges. They were solved using Ansible, which also had the effect of standardizing their system administration practices. Retour d'exp rience : utilisation de Debian chez Evolix Speaker: Gr gory Colpart Gr gory described Evolix, a company which manages servers for their clients, and how they were inspired by Debian, for both their internal tools and their practices. It is very interesting to see that some of the Debian values can be easily exported for a more open and collaborative business. Lightning talks To close the conference, two lightning talks were presented, describing the switch from Windows XP to Debian in an ecologic association near Toulouse; and how snapshot.debian.org can be used with bisections to find the source of some regressions. Conclusion A big thank you to all the organizers and the associations who contributed to make this event a success. Cheers!

23 August 2017

Antoine Beaupr : The supposed decline of copyleft

At DebConf17, John Sullivan, the executive director of the FSF, gave a talk on the supposed decline of the use of copyleft licenses use free-software projects. In his presentation, Sullivan questioned the notion that permissive licenses, like the BSD or MIT licenses, are gaining ground at the expense of the traditionally dominant copyleft licenses from the FSF. While there does seem to be a rise in the use of permissive licenses, in general, there are several possible explanations for the phenomenon.

When the rumor mill starts Sullivan gave a recent example of the claim of the decline of copyleft in an article on Opensource.com by Jono Bacon from February 2017 that showed a histogram of license usage between 2010 and 2017 (seen below).

From that, Bacon elaborates possible reasons for the apparent decline of the GPL. The graphic used in the article was actually generated by Stephen O'Grady in a January article, The State Of Open Source Licensing, which said:
In Black Duck's sample, the most popular variant of the GPL version 2 is less than half as popular as it was (46% to 19%). Over the same span, the permissive MIT has gone from 8% share to 29%, while its permissive cousin the Apache License 2.0 jumped from 5% to 15%.
Sullivan, however, argued that the methodology used to create both articles was problematic. Neither contains original research: the graphs actually come from the Black Duck Software "KnowledgeBase" data, which was partly created from the old Ohloh web site now known as Open Hub. To show one problem with the data, Sullivan mentioned two free-software projects, GNU Bash and GNU Emacs, that had been showcased on the front page of Ohloh.net in 2012. On the site, Bash was (and still is) listed as GPLv2+, whereas it changed to GPLv3 in 2011. He also claimed that "Emacs was listed as licensed under GPLv3-only, which is a license Emacs has never had in its history", although I wasn't able to verify that information from the Internet archive. Basically, according to Sullivan, "the two projects featured on the front page of a site that was using [the Black Duck] data set were wrong". This, in turn, seriously brings into question the quality of the data:
I reported this problem and we'll continue to do that but when someone is not sharing the data set that they're using for other people to evaluate it and we see glimpses of it which are incorrect, that should give us a lot of hesitation about accepting any conclusion that comes out of it.
Reproducible observations are necessary to the establishment of solid theories in science. Sullivan didn't try to contact Black Duck to get access to the database, because he assumed (rightly, as it turned out) that he would need to "pay for the data under terms that forbid you to share that information with anybody else". So I wrote Black Duck myself to confirm this information. In an email interview, Patrick Carey from Black Duck confirmed its data set is proprietary. He believes, however, that through a "combination of human and automated techniques", Black Duck is "highly confident at the accuracy and completeness of the data in the KnowledgeBase". He did point out, however, that "the way we track the data may not necessarily be optimal for answering the question on license use trend" as "that would entail examination of new open source projects coming into existence each year and the licenses used by them". In other words, even according to Black Duck, its database may not be useful to establish the conclusions drawn by those articles. Carey did agree with those conclusions intuitively, however, saying that "there seems to be a shift toward Apache and MIT licenses in new projects, though I don't have data to back that up". He suggested that "an effective way to answer the trend question would be to analyze the new projects on GitHub over the last 5-10 years." Carey also suggested that "GitHub has become so dominant over the recent years that just looking at projects on GitHub would give you a reasonable sampling from which to draw conclusions".

Indeed, GitHub published a report in 2015 that also seems to confirm MIT's popularity (45%), surpassing copyleft licenses (24%). The data is, however, not without its own limitations. For example, in the above graph going back to the inception of GitHub in 2008, we see a rather abnormal spike in 2013, which seems to correlate with the launch of the choosealicense.com site, described by GitHub as "our first pass at making open source licensing on GitHub easier". In his talk, Sullivan was critical of the initial version of the site which he described as biased toward permissive licenses. Because the GitHub project creation page links to the site, Sullivan explained that the site's bias could have actually influenced GitHub users' license choices. Following a talk from Sullivan at FOSDEM 2016, GitHub addressed the problem later that year by rewording parts of the front page to be more accurate, but that any change in license choice obviously doesn't show in the report produced in 2015 and won't affect choices users have already made. Therefore, there can be reasonable doubts that GitHub's subset of software projects may not actually be that representative of the larger free-software community.

In search of solid evidence So it seems we are missing good, reproducible results to confirm or dispel these claims. Sullivan explained that it is a difficult problem, if only in the way you select which projects to analyze: the impact of a MIT-licensed personal wiki will obviously be vastly different from, say, a GPL-licensed C compiler or kernel. We may want to distinguish between active and inactive projects. Then there is the problem of code duplication, both across publication platforms (a project may be published on GitHub and SourceForge for example) but also across projects (code may be copy-pasted between projects). We should think about how to evaluate the license of a given project: different files in the same code base regularly have different licenses often none at all. This is why having a clear, documented and publicly available data set and methodology is critical. Without this, the assumptions made are not clear and it is unreasonable to draw certain conclusions from the results. It turns out that some researchers did that kind of open research in 2016 in a paper called "The Debsources Dataset: Two Decades of Free and Open Source Software" [PDF] by Matthieu Caneill, Daniel M. Germ n, and Stefano Zacchiroli. The Debsources data set is the complete Debian source code that covers a large history of the Debian project and therefore includes thousands of free-software projects of different origins. According to the paper:
The long history of Debian creates a perfect subject to evaluate how FOSS licenses use has evolved over time, and the popularity of licenses currently in use.
Sullivan argued that the Debsources data set is interesting because of its quality: every package in Debian has been reviewed by multiple humans, including the original packager, but also by the FTP masters to ensure that the distribution can legally redistribute the software. The existence of a package in Debian provides a minimal "proof of use": unmaintained packages get removed from Debian on a regular basis and the mere fact that a piece of software gets packaged in Debian means at least some users found it important enough to work on packaging it. Debian packagers make specific efforts to avoid code duplication between packages in order to ease security maintenance. The data set covers a period longer than Black Duck's or GitHub's, as it goes all the way back to the Hamm 2.0 release in 1998. The data and how to reproduce it are freely available under a CC BY-SA 4.0 license.

Sullivan presented the above graph from the research paper that showed the evolution of software license use in the Debian archive. Whereas previous graphs showed statistics in percentages, this one showed actual absolute numbers, where we can't actually distinguish a decline in copyleft licenses. To quote the paper again:
The top license is, once again, GPL-2.0+, followed by: Artistic-1.0/GPL dual-licensing (the licensing choice of Perl and most Perl libraries), GPL-3.0+, and Apache-2.0.
Indeed, looking at the graph, at most do we see a rise of the Apache and MIT licenses and no decline of the GPL per se, although its adoption does seem to slow down in recent years. We should also mention the possibility that Debian's data set has the opposite bias: toward GPL software. The Debian project is culturally quite different from the GitHub community and even the larger free-software ecosystem, naturally, which could explain the disparity in the results. We can only hope a similar analysis can be performed on the much larger Software Heritage data set eventually, which may give more representative results. The paper acknowledges this problem:
Debian is likely representative of enterprise use of FOSS as a base operating system, where stable, long-term and seldomly updated software products are desirable. Conversely Debian is unlikely representative of more dynamic FOSS environments (e.g., modern Web-development with micro libraries) where users, who are usually developers themselves, expect to receive library updates on a daily basis.
The Debsources research also shares methodology limitations with Black Duck: while Debian packages are reviewed before uploading and we can rely on the copyright information provided by Debian maintainers, the research also relies on automated tools (specifically FOSSology) to retrieve license information. Sullivan also warned against "ascribing reason to numbers": people may have different reasons for choosing a particular license. Developers may choose the MIT license because it has fewer words, for compatibility reasons, or simply because "their lawyers told them to". It may not imply an actual deliberate philosophical or ideological choice. Finally, he brought up the theory that the rise of non-copyleft licenses isn't necessarily at the detriment of the GPL. He explained that, even if there is an actual decline, it may not be much of a problem if there is an overall growth of free software to the detriment of proprietary software. He reminded the audience that non-copyleft licenses are still free software, according to the FSF and the Debian Free Software Guidelines, so their rise is still a positive outcome. Even if the GPL is a better tool to accomplish the goal of a free-software world, we can all acknowledge that the conversion of proprietary software to more permissive and certainly simpler licenses is definitely heading in the right direction.
[I would like to thank the DebConf organizers for providing meals for me during the conference.] Note: this article first appeared in the Linux Weekly News.

22 October 2016

Matthieu Caneill: Debugging 101

While teaching this semester a class on concurrent programming, I realized during the labs that most of the students couldn't properly debug their code. They are at the end of a 2-year cursus, know many different programming languages and frameworks, but when it comes to tracking down a bug in their own code, they often lacked the basics. Instead of debugging for them I tried to give them general directions that they could apply for the next bugs. I will try here to summarize the very first basic things to know about debugging. Because, remember, writing software is 90% debugging, and 10% introducing new bugs (that is not from me, but I could not find the original quote). So here is my take at Debugging 101. Use the right tools Many good tools exist to assist you in writing correct software, and it would put you behind in terms of productivity not to use them. Editors which catch syntax errors while you write them, for example, will help you a lot. And there are many features out there in editors, compilers, debuggers, which will prevent you from introducing trivial bugs. Your editor should be your friend; explore its features and customization options, and find an efficient workflow with them, that you like and can improve over time. The best way to fix bugs is not to have them in the first place, obviously. Test early, test often I've seen students writing code for one hour before running make, that would fail so hard that hundreds of lines of errors and warnings were outputted. There are two main reasons doing this is a bad idea:

You have to debug all the errors at once, and the complexity of solving many bugs, some dependent on others, is way higher than the complexity of solving a single bug. Moreover, it's discouraging.
Wrong assumptions you made at the beginning will make the following lines of code wrong. For example if you chose the wrong data structure for storing some information, you will have to fix all the code using that structure. It's less painful to realize earlier it was the wrong one to choose, and you have more chances of knowing that if you compile and execute often.

I recommend to test your code (compilation and execution) every few lines of code you write. When something breaks, chances are it will come from the last line(s) you wrote. Compiler errors will be shorter, and will point you to the same place in the code. Once you get more confident using a particular language or framework, you can write more lines at once without testing. That's a slow process, but it's ok. If you set up the right keybinding for compiling and executing from within your editor, it shouldn't be painful to test early and often. Read the logs Spot the places where your program/compiler/debugger writes text, and read it carefully. It can be your terminal (quite often), a file in your current directory, a file in /var/log/, a web page on a local server, anything. Learn where different software write logs on your system, and integrate reading them in your workflow. Often, it will be your only information about the bug. Often, it will tell you where the bug lies. Sometimes, it will even give you hints on how to fix it. You may have to filter out a lot of garbage to find relevant information about your bug. Learn to spot some keywords like error or warning. In long stacktraces, spot the lines concerning your files; because more often, your code is to be blamed, rather than deeper library code. grep the logs with relevant keywords. If you have the option, colorize the output. Use tail -f to follow a file getting updated. There are so many ways to grasp logs, so find what works best with you and never forget to use it! Print foobar That one doesn't concern compilation errors (unless it's a Makefile error, in that case this file is your code anyway). When the program logs and output failed to give you where an error occured (oh hi Segmentation fault!), and before having to dive into a memory debugger or system trace tool, spot the portion of your program that causes the bug and add in there some print statements. You can either print("foo") and print("bar"), just to know that your program reaches or not a certain place in your code, or print(some_faulty_var) to get more insights on your program state. It will give you precious information.

stderr >> "foo" >> endl;
my_db.connect(); // is this broken?
stderr >> "bar" >> endl;

In the example above, you can be sure it is the connection to the database my_db that is broken if you get foo and not bar on your standard error. (That is an hypothetical example. If you know something can break, such as a database connection, then you should always enclose it in a try/catch structure). Isolate and reproduce the bug This point is linked to the previous one. You may or may not have isolated the line(s) causing the bug, but maybe the issue is not always raised. It can depend on many other things: the program or function parameters, the network status, the amount of memory available, the decisions of the OS scheduler, the user rights on the system or on some files, etc. More generally, any assumption you made on any external dependency can appear to be wrong (even if it's right 99% of the time). According to the context, try to isolate the set of conditions that trigger the bug. It can be as simple as "when there is no internet connection", or as complicated as "when the CPU load of some external machine is too high, it's a leap year, and the input contains illegal utf-8 characters" (ok, that one is fucked up; but it surely happens!). But you need to reliably be able to reproduce the bug, in order to be sure later that you indeed fixed it. Of course when the bug is triggered at every run, it can be frustrating that your program never works but it will in general be easier to fix. RTFM Always read the documentation before reaching out for help. Be it man, a book, a website or a wiki, you will find precious information there to assist you in using a language or a specific library. It can be quite intimidating at first, but it's often organized the same way. You're likely to find a search tool, an API reference, a tutorial, and many examples. Compare your code against them. Check in the FAQ, maybe your bug and its solution are already referenced there. You'll rapidly find yourself getting used to the way documentation is organized, and you'll be more and more efficient at finding instantly what you need. Always keep the doc window open! Google and Stack Overflow are your friends Let's be honest: many of the bugs you'll encounter have been encountered before. Learn to write efficient queries on search engines, and use the knowledge you can find on questions&answers forums like Stack Overflow. Read the answers and comments. Be wise though, and never blindly copy and paste code from there. It can be as bad as introducing malicious security issues into your code, and you won't learn anything. Oh, and don't copy and paste anyway. You have to be sure you understand every single line, so better write them by hand; it's also better for memorizing the issue. Take notes Once you have identified and solved a particular bug, I advise to write about it. No need for shiny interfaces: keep a list of your bugs along with their solutions in one or many text files, organized by language or framework, that you can easily grep. It can seem slightly cumbersome to do so, but it proved (at least to me) to be very valuable. I can often recall I have encountered some buggy situation in the past, but don't always remember the solution. Instead of losing all the debugging time again, I search in my bug/solution list first, and when it's a hit I'm more than happy I kept it. Further ~~reading~~ degugging Remember this was only Debugging 101, that is, the very first steps on how to debug code on your own, instead of getting frustrated and helplessly stare at your screen without knowing where to begin. When you'll write more software, you'll get used to more efficient workflows, and you'll discover tools that are here to assist you in writing bug-free code and spotting complex bugs efficiently. Listed below are some of the tools or general ideas used to debug more complex software. They belong more to a software engineering course than a Debugging 101 blog post. But it's good to know as soon as possible these exist, and if you read the manuals there's no reason you can't rock with them!

Loggers. To make the "foobar" debugging more efficient, some libraries are especially designed for the task of logging out information about a running program. They often have way more features than a simple print statement (at the price of being over-engineered for simple programs): severity levels (info, warning, error, fatal, etc), output in rotating files, and many more.
Version control. Following the evolution of a program in time, over multiple versions, contributors and forks, is a hard task. That's where version control plays: it allows you to keep the entire history of your program, and switch to any previous version. This way you can identify more easily when a bug was introduced (and by whom), along with the patch (a set of changes to a code base) that introduced it. Then you know where to apply your fix. Famous version control tools include Git, Subversion, and Mercurial.
Debuggers. Last but not least, it wouldn't make sense to talk about debugging without mentioning debuggers. They are tools to inspect the state of a program (for example the type and value of variables) while it is running. You can pause the program, and execute it line by line, while watching the state evolve. Sometimes you can also manually change the value of variables to see what happens. Even though some of them are hard to use, they are very valuable tools, totally worth diving into!

Don't hesitate to comment on this, and provide your debugging 101 tips! I'll be happy to update the article with valuable feedback. Happy debugging!

3 September 2016

Bits from Debian: New Debian Developers and Maintainers (July and August 2016)

The following contributors got their Debian Developer accounts in the last two months:

Edward John Betts (edward)
Holger Wansing (holgerw)
Timothy Martin Potter (tpot)
Martijn van Brummelen (mvb)
St phane Blondon (sblondon)
Bertrand Marc (bmarc)
Jochen Sprickerhof (jspricke)
Ben Finney (bignose)
Breno Leitao (leitao)
Zlatan Todoric (zlatan)
Ferenc W gner (wferi)
Matthieu Caneill (matthieucan)
Steven Chamberlain (stevenc)

The following contributors were added as Debian Maintainers in the last two months:

Jonathan Cristopher Carter
Reiner Herrmann
Michael Jeanson
Jens Reyer
Jerome Benoit
Fr d ric Bonnard
Olek Wojnar

Congratulations!

19 July 2016

Michael Prokop: DebConf16 in Capetown/South Africa: Lessons learnt

DebConf 16 in Capetown/South Africa was fantastic for many reasons. My Capetown/South Africa/Culture/Flight related lessons:

Avoid flying on Sundays (especially in/from Austria where plenty of hotlines are closed on Sundays or at least not open when you need them)
Actually turn back your seat on the flight when trying to sleep and not forget that this option exists *cough*
While UCT claims to take energy saving quite serious (e.g. turn off the lights mentioned at many places around the campus), several toilets flush all their water, even when trying to do just small business and also two big lights in front of a main building seem to be shining all day long for no apparent reason
There doesn t seem to be a standard for the side of hot vs. cold water-taps
Soap pieces and towels on several toilets
For pedestrians there s just a very short time of green at the traffic lights (~2-3 seconds), then red blinking lights show that you can continue walking across the street (but *should* not start walking) until it s fully red again (but not many people seem to care about the rules anyway :))
Warning lights of cars are used for saying thanks (compared to hand waving in e.g. Austria)
The 40km/h speed limit signs on the roads seem to be showing the recommended minimum speed :-)
There are many speed bumps on the roads
Geese quacking past 11:00 p.m. close to a sleeping room are something I m also not used to :-)
Announced downtimes for the Internet connection are something I m not used to
WLAN in the dorms of UCT as well as in any other place I went to at UCT worked excellent (measured ~22-26 Mbs downstream in my room, around 26Mbs in the hacklab) (kudos!)
WLAN is available even on top of the Table Mountain (WLAN working and being free without any registration)
Number26 credit card is great to withdraw money from ATMs without any extra fees from common credit card companies (except for the fee the ATM itself charges but displays ahead on-site anyway)
Splitwise is a nice way to share expenses on the road, especially with its mobile app and the money beaming using the Number26 mobile app

My technical lessons from DebConf16:

ran into way too many yak-shaving situations, some of them might warrant separate blog posts
finally got my hands on gbp-pq (manage quilt patches on patch queue branches in git): very nice to be able to work with plain git and then get patches for your changes, also having upstream patches (like cherry-picks) inside debian/patches/ and the debian specific changes inside debian/patches/debian/ is a lovely idea, this can be easily achieved via Gbp-Pq: Topic debian with gbp s pq and is used e.g. in pkg-systemd, thanks to Michael Biebl for the hint and helping hand
David Bremner s gitpkg/git-debcherry is something to also be aware of (thanks for the reminder, gregoa)
autorevision: extracts revision metadata from your VCS repository (thanks to pabs)
blhc: build log hardening check
Guido s gbp skills exchange session reminded me once again that I should use gbp import-dsc download $URL_TO_DSC more often
sources.debian.net features specific copyright + patches sections (thanks, Matthieu Caneill)
dpkg-mergechangelogs(1) for 3-way merge of debian/changelog files (thanks, buxy)
meta-git from pkg-perl is always worth a closer look
ifupdown2 (its current version is also available in jessie-backports!) has some nice features, like ifquery running $interface to get the life configuration of a network interface, json support ( ifquery format=json ) and makotemplates support to generate configuration for plenty of interfaces

BTW, thanks to the video team the recordings from the sessions are available online.

8 February 2016

Orestis Ioannou: Debian - your patches and machine readable copyright files are available on Debsources

TL;DR All Debian license and patches are belong to us. Discover them here and here. In case you hadn't already stumbled upon sources.debian.net in the past, Debsources is a simple web application that allows to publish an unpacked Debian source mirror on the Web. On the live instance you can browse the contents of Debian source packages with syntax highlighting, search files matching a SHA-256 hash or a ctag, query its API, highlight lines, view accurate statistics and graphs. It was initially developed at IRILL by Stefano Zacchiroli and Matthieu Caneill. During GSOC 2015 I helped introduce two new features. License Tracker Since Debsources has all the debian/copyright files and that many of them adopted the DEP-5 suggestion (machine readable copyright files) it was interesting to exploit them for end users. You may find interesting the following features:

an API that allows users to find the license of file "foo" or the licenses for a bunch of packages, using filenames or SHA-256 hashes
a better looking interface for debian/copyright files

Have a look at the documentation to discover more! Patch tracker The old patch tracker unfortunately died a while ago. Since Debsources stores all the patches it was, probably, natural for it to be able to exploit them and present them over the web. You can navigate through packages by prefix or by searching them here. Among the use cases:

a summary which contains all the patches of a package together with their diffs and summaries/subjects
links to view and download (quilt-3.0) patches.

15 August 2015

Matthieu Caneill: A one-liner to catch'em all!

I wrote a Bash one-liner to open the source code (in Debsources) of any file on your system (if it belongs to a Debian package). It will simply retrieve the associated package and point your default browser to its source code. Add this somewhere in your $PATH, and name this file debsrc:

#!/bin/bash
function debsrc  
    readlink -f $1   xargs dpkg-query --search   awk -F ": " ' print $1 '   xargs apt-cache showsrc   grep-dctrl -s 'Package' -n ''   awk -F " " ' print "http://sources.debian.net/src/"$1"/latest/" '   xargs x-www-browser
 
CMD="$1"
debsrc $ CMD

And try something like debsrc /usr/share/doc/acpi/AUTHORS. Enjoy! Update: improved the one-liner thanks to josch's advice.

6 May 2015

Matthieu Caneill: Debsources got swag and continous integration

Debsources (http://sources.debian.net) is still under active development. We recently had a Gnome Outreachy intern, Jingjie Jiang, and we're about to work with 2 GSoC students, Cl ment Schreiner and Orestis Ioannou. I will present here the GitHub mirror we've set up, in order to allow external pull requests to be submitted, and to use the continous integration service provided by Travis-CI. GitHub and Travis-CI Debsources' source code is hosted on Debian's git servers, and from there is mirrored to GitHub. Every time a commit is pushed (to master or other branches) or a pull request is open, the test suite will be automatically run on Travis-CI, and the result (tests pass or don't) is displayed on GitHub. This allows us to quickly filter external contributions (when they are submitted on GitHub), and be sure everything works with our setup, before reviewing work. Travis-CI runs the tests on OpenVZ containers. The complete infrastructure was a bit challenging to setup, but as we now have a Docker recipe to quicly begin to hack on Debsources, most of the work could be done using the Dockerfile instructions. In average, a run on Travis-CI (which includes git cloning the code and test data, setup the server, and run the tests suite) takes 7 minutes, which is an ok amount of time to wait for before submitting a pull request, in my opinion. Bugs discovered in the process Setting up this continuous integration infrastructure made me discover a few bugs. Python magic does black magic Debsources runs fine on Debian (not surprisingly), but I got tricked by black magic when I tried to run it on Ubuntu (which is the OS run in Travis-CI's containers). We use the magic library to guess the type of files we're dealing with, for instance when we need to decide between rendering a file (for text files) or downloading it (for binary files). Here comes the tricky part: the Python bindings for libmagic are not the same in Debian and Pypi. Debsources uses Debian package python-magic, which is not in Ubuntu 12.04. Moreover, there's no Python egg for it on Pypi, which has however another package (called magic) which provides a different API. I solved this with a dirty hack, using the fact python-magic lies in a single file:

mkdir /tmp/python-magic && wget https://raw.githubusercontent.com/file/file/master/python/magic.py -O /tmp/python-magic/magic.py && export PYTHONPATH=/tmp/python-magic/:$PYTHONPATH

It simply downloads the library, saves it in a temporary folder and includes it in the Python path. Let's see for how long it works before everything breaks! Size of a directory One test in the suite was ensuring the information returned by ls -l on a directory and stored in the DB was the right information. Inode metadata was tested, such as name, permissions, type, or size. Interestingly enough, the size of a directory was tested, and expected to be 4096 bytes. The size of a directory actually depends on the filesystem in use, and on the number of files this directory contains. We often see 4096 because it's the size of a not-too-big directory on ext4. Travis-CI doesn't use ext4:

$ df -T
Filesystem            Type     1K-blocks      Used Available Use%
Mounted on
/vz/private/209140041 simfs    125829120 103460612  22368508  83% /
none                  devtmpfs   1572864         8   1572856   1% /dev
none                  tmpfs       314576        56    314520   1% /run
none                  tmpfs         5120         4      5116   1%
/run/lock
none                  tmpfs      1572864         0   1572864   0%
/run/shm
/dev/null             tmpfs       786432    171584    614848  22%
/var/ramfs

Simfs is a container filesystem for OpenVZ, on which directories have different sizes than on ext4:

$ ls -al /
total 0
drwxr-xr-x 23 root     root      480 Feb  4 18:08 .
drwxr-xr-x 23 root     root      480 Feb  4 18:08 ..
drwxr-xr-x  2 root     root     2480 Feb  4 18:20 bin
drwxr-xr-x  2 root     root       40 Apr 19  2012 boot
drwxr-xr-x  5 root     root      660 Apr 30 13:56 dev
drwxr-xr-x 99 root     root     3560 Apr 30 13:56 etc
-rw-r--r--  1 root     root        0 Feb  4 17:56 fastboot
drwxr-xr-x  3 root     root       80 Feb  4 17:57 home
[...]

Directory sizes are not even powers of 2. Hence I changed the test to not check directory sizes. Hopefully this will help to make Debsources work on more filesystems! An empty file is hiding Last but not least, because this bug is still open in the wild. A file, which appears to be empty, is not taken into account by Debsources' updater. This file is sources/non-free/m/make-doc-non-dfsg/4.0-2/.pc/applied-patches. It is present in the filesystem in the container, is not the only empty file over there, but still doesn't appear in the database, and make fail the test which counts files. The test has been commented out (booooooh), so that we still can use Travis-CI's platform for our GSoC students, before it's fixed. Conclusion Making Debsources run automatically on a different platform as the one we usually use permitted us to spot bugs, write dirty hacks, and expand the filesystems it's supposed to run on. Now, let's hope the continuous integration will help our GSoC students, and let's wish them good luck!

6 June 2014

Stefano Zacchiroli: debsources paper at ESEM2014

Debsources: Live and Historical Views on Macro-Level Software Evolution The paper entitled Debsources: Live and Historical Views on Macro-Level Software Evolution, which I've co-authored with Matthieu Caneill, has been accepted at ESEM 2014: the 8th international symposium on Emprical Software Engineering and Measurement. In the paper we have described Debsources as a software platform for monitoring the evolution of Free Software through the lenses of Debian, and used the main Debsources instance (http://sources.debian.net) to replicate and extend a former study on macro-level software evolution. Now we "just" have to integrate all the nice charts and data we have extracted for the paper into Debsources' stats page... /o\

27 February 2014

Stefano Zacchiroli: moar stats for sources.debian.net

Debian: watch your stats! Over the past few weeks, myself and Matthieu Caneill have worked quite a bit on Debsources. As we have now deployed most of the new features on http://sources.debian.net, it's time for another "What's new with Debsources?" blog post. Here is what's new:

Debsources now knows about Debian suites, i.e. which package is in which "release" (stable, testing, unstable, ...). This knowledge is already useful for some of the other features below and will be used more in the future.
since last summer Debsources has been running sloccount on all unpacked source packages, together with ctags and du, but the resulting information wasn't exposed on the Web. This is now fixed. Each package now has an infobox (example) which shows: disk usage, archive area, suites, and sloccount with per-language breakdown. The new infobox also subsumes the old puny list of package links. You can easily embed the infobox in other webapps if you need to (example). Check the URL scheme doc for more info.
Debsources now gathers and plot accurate Debian sources statistics, both overall and per-suite, in both snapshot and historical trends flavors. (Yeah, I know, the charts are not particularly good looking ATM, but that's easy to change without impacting the rest. So if you're a matplotlib artist and willing to help, please step forward!)
many changes have been going on also at the plumbing layer to make the service less resource hungry and more maintainable, in view of a migration to the official Debian infrastructure --- which I've in the meantime started discussing with DSA. Some highlights:
- Debsources now has a rather comprehensive test suite, built using Nose. Most notably, we do test full update runs down to source unpacking (of a small subset of a Debian mirror), DB injection, and plugin execution --- which is quite neat.
- the updater is now much faster (about 2x) and might require, in pathological cases, 10x less memory than before. Memory usage now caps at around 300MB, even when injecting ctags for large packages such as linux, chromium, and libreoffice.
- the DB schema went through several refactoring cycles, and now uses a separate file table to index all known source file paths. In the past path information were duplicated across the checksums and ctags tables, not only wasting DB space, but also making the presence of file information conditional on the enablement of at least one of the two corresponding plugins. This is now fixed --- and migrating the full DB has been quite "fun". Unfortunately, we've also added quite a few large-ish indexes, resulting in no significant overall changes in DB size (currently at ~50GB), but at least in much faster queries The next step on this front will be the addition of path-based searches, using the excellent Postgres trigram indexes.

Want more? Sure, we'll be happy to! But it'll happen faster if you help. Speaking of which: we've got Debsources into the new contributors game (see announcement) and we're looking forward to mentor new contributors.

17 September 2013

Stefano Zacchiroli: sources.debian.net - advanced search and other news

all your ctag (and checksum) are belong to us A few months after the initial announcement, here are some news about the sources.d.n service. I've been late in blogging this, but most of it has been implemented by myself and Matthieu Caneill during DebConf13, which has been a great DebConf, totally exceeding my expectations (and they were already fairly high!). First, you might have noticed some user-visible changes:

there is now an advanced search page, which complements the already existing regex code search with the possibility of searching source files by their sha256, or the ctags defined therein
on the same topic, when browsing through a package and using regex search, you'll now search by default within that package, allowing to focus your searches more easily than before. (You can easily override this by editing the search box and removing the package: predicate.)
for the data geeks (or the wannabe host), there are now disk usage stats (note that they don't include the database size, though, see below for that)
the website also got a significant facelift, as part of which we have moved the detailed explanations of what the service is about out of your way. You now immediately get to the various browsing options.

On the other hand, under the hood:

to implement ctags and sha256 searches we needed a serious DBMS, so we switched from SQLite to PostgreSQL. Again, for the data geek: storing ctags/sha256 for all of sources.d.n content with decent indexes takes about 37 GB, for about 160 million rows in the ctags table and 20 million rows in the checksums one. (Currently filenames are duplicated between the two tables so, probably, the DB disk size might be reduced some.)
together with the switch to a serious DBMS, the update logics has been completely rewritten in Python (from Bash...), and should now be entirely transactional.
... and given it was going to be Python anyhow, better to enjoy what it has to offer, no? So there is now a plugin mechanism that makes it easier to add extra data extractors, triggering them at each package update. Currently there are plugins for sha256sum, ctags, and sloccount (even though the latter is not yet exposed via the web interface). An added benefit of this is that if you want to deploy debsources elsewhere, you can easily disable the most time consuming extractors: running ctags and sha256sum on the fabulous 3 chromium/libreoffice/linux is not for the faint of disks...
we now receive push updates from the Debian mirror network, so that you'll get updates on sources.d.n as soon as a package hits Debian mirrors (+ processing time, which is about 15-20 minutes on the average update run). Many thanks to Simon Paillard and Adam Lackorzynski for their help in setting this up.
thanks to a suggestion by kugel we have adopted Geany's conventions for filetype detection, and we now take into account both file extensions and shebang lines (when available)

As you usual, your bug reports (and patches!) are more than welcome, just check BUGS before reporting to avoid duplicates.
That's all!

2 July 2013

Bits from Debian: all Debian source are belong to us

This is a verbatim repost from Stefano Zacchiroli's post TL;DR: go to http://sources.debian.net and enjoy.

Debsources is a new toy I've been working on at IRILL together with Matthieu Caneill. In essence, debsources is a simple web application that allows to publish an unpacked Debian source mirror on the Web. You can deploy Debsources where you please, but there is a main instance at http://sources.debian.net (sources.d.n for short) that you will probably find interesting. sources.d.n follows closely the Debian archive in two ways:

it is updated 4 times a day to reflect the content of the Debian archive
it contains sources coming from official Debian suites: the usual ones (from oldstable to experimental), *-updates (ex volatile), *-proposed-updates, and *-backports (from Wheezy on)

Via sources.d.n you can therefore browse the content of Debian source packages with usual code viewing features like syntax highlighting. More interestingly, you can search through the source code (of unstable only, though) via integration with http://codesearch.debian.net. You can also use sources.d.n programmatically to query available versions or link to specific lines, with the possibility of adding contextual pop-up messages (example). In fact, you might have stumbled upon sources.d.n already in the past few days, via other popular Debian services where it has already been integrated. In particular: codesearch.d.n now defaults to show results via sources.d.n, and the PTS has grown new "browse source code" hyperlinks that point to it. If you've ideas of other Debian services where sources.d.n should be integrated, please let me know. I find Debsources and sources.d.n already quite useful but, as it often happens, there is still a lot TODO. Obviously, it is all Free Software (released under GNU AGPLv3). Do not hesitate to report new bugs and, better, to submit patches for the outstanding ones. Acknowledgements

Matthieu Caneill is the main developer of Debsources web front-end; sources.d.n wouldn't exist without him.
others have already contributed patches to integrate sources.d.n with other services, in particular:
many thanks to Michael Stapelberg (for codesearch.d.n integration), and
Paul Wise (for PTS integration).
a full list of contributors is available and eagerly waiting for new additions
IRILL has kindly provided sponsoring for Matthieu's initial development work on Debsources, and offered both the server and hosting facilities that power sources.d.n

PS in case you were wondering: at present sources.d.n requires ~381 GB of disk space to hold all uncompressed source packages, plus ~83 GB for the local (compressed) source mirror

Stefano Zacchiroli: introducing sources.debian.net

all Debian source are belong to us TL;DR: go to http://sources.debian.net and enjoy.

it is updated 4 times a day to reflect the content of the Debian archive
it contains sources coming from official Debian suites: the usual ones (from oldstable to experimental), *-updates (ex volatile), *-proposed-updates, and *-backports (from Wheezy on)

Matthieu Caneill is the main developer of Debsources web front-end; sources.d.n wouldn't exist without him.
others have already contributed patches to integrate sources.d.n with other services, in particular:
- many thanks to Michael Stapelberg (for codesearch.d.n integration), and
- Paul Wise (for PTS integration).
a full list of contributors is available and eagerly waiting for new additions
IRILL has kindly provided sponsoring for Matthieu's initial development work on Debsources, and offered both the server and hosting facilities that power sources.d.n

PS in case you were wondering: at present sources.d.n requires ~381 GB of disk space to hold all uncompressed source packages, plus ~83 GB for the local (compressed) source mirror