Search Results: "Matthieu Caneill"

23 January 2022

Matthieu Caneill: Debsources, python3, and funky file names

Rumors are running that python2 is not a thing anymore. Well, I'm certainly late to the party, but I'm happy to report that sources.debian.org is now running python3. Wait, it wasn't? Back when development started, python3 was very much a real language, but it was hard to adopt because it was not supported by many libraries. So python2 was chosen, meaning print-based debugging was used in lieu of print()-based debugging, and str were bytes, not unicode. And things were working just fine. One day python2 EOL was announced, with a date far in the future. Far enough to procrastinate for a long time. Combine this with a codebase that is stable enough to not see many commits, and the fact that Debsources is a volunteer-based project that happens at best on week-ends, and you end up with a dormant software and a missed deadline. But, as dormant as the codebase is, the instance hosted at sources.debian.org is very popular and gets 200k to 500k hits per day. Largely enough to be worth a proper maintenance and a transition to python3. Funky file names While transitioning to python3 and juggling left and right with str, bytes and unicode for internal objects, files, database entries and HTTP content, I stumbled upon a bug that has been there since day 1. Quick recap if you're unfamiliar with this tool: Debsources displays the content of the source packages in the Debian archive. In other words, it's a bit like GitHub, but for the Debian source code. And some pieces of software out there, that ended up in Debian packages, happen to contain files whose names can't be decoded to UTF-8. Interestingly enough, there's no such thing as a standard for file names: with a few exceptions that vary by operating system, any sequence of bytes can be a legit file name. And some sequences of bytes are not valid UTF-8. Of course those files are rare, and using ASCII characters to name a file is a much more common practice than using bytes in a non-UTF-8 character encoding. But when you deal with almost 100 million files on which you have no control (those files come from free software projects, and make their way into Debian without any renaming), it happens. Now back to the bug: when trying to display such a file through the web interface, it would crash because it can't convert the file name to UTF-8, which is needed for the HTML representation of the page. Bugfix An often valid approach when trying to represent invalid UTF-8 content is to ignore errors, and replace them with ? or . This is what Debsources actually does to display non-UTF-8 file content. Unfortunately, this best-effort approach is not suitable for file names, as file names are also identifiers in Debsources: among other places, they are part of URLs. If an URL were to use placeholder characters to replace those bytes, there would be no deterministic way to match it with a file on disk anymore. The representation of binary data into text is a known problem. Multiple lossless solutions exist, such as base64 and its variants, but URLs looking like https://sources.debian.org/src/Y293c2F5LzMuMDMtOS4yL2Nvd3NheS8= are not readable at all compared to https://sources.debian.org/src/cowsay/3.03-9.2/cowsay/. Plus, not backwards-compatible with all existing links. The solution I chose is to use double-percent encoding: this allows the representation of any byte in an URL, while keeping allowed characters unchanged - and preventing CGI gateways from trying to decode non-UTF-8 bytes. This is the best of both worlds: regular file names get to appear normally and are human-readable, and funky file names only have percent signs and hex numbers where needed. Here is an example of such an URL: https://sources.debian.org/src/aspell-is/0.51-0-4/%25EDslenska.alias/. Notice the %25ED to represent the percentage symbol itself (%25) followed by an invalid UTF-8 byte (%ED). Transitioning to this was quite a challenge, as those file names don't only appear in URLs, but also in web pages themselves, log files, database tables, etc. And everything was done with str: made sense in python2 when str were bytes, but not much in python3. What are those files? What's their network? I was wondering too. Let's list them!
import os
with open('non-utf-8-paths.bin', 'wb') as f:
    for root, folders, files in os.walk(b'/srv/sources.debian.org/sources/'):
        for path in folders + files:
            try:
                path.decode('utf-8')
            except UnicodeDecodeError:
                f.write(root + b'/' + path + b'\n')
Running this on the Debsources main instance, which hosts pretty much all Debian packages that were part of a Debian release, I could find 307 files (among a total of almost 100 million files). Without looking deep into them, they seem to fall into 2 categories: That last point hits home, as it was clearly lacking in Debsources. A funky file name is now part of its test suite. ;)

18 November 2017

Matthieu Caneill: MiniDebconf in Toulouse

I attended the MiniDebconf in Toulouse, which was hosted in the larger Capitole du Libre, a free software event with talks, presentation of associations, and a keysigning party. I didn't expect the event to be that big, and I was very impressed by its organization. Cheers to all the volunteers, it has been an amazing week-end! Here's a sum-up of the talks I attended. Du logiciel libre la monnaie libre Speaker: lo s The first talk I attended was, translated to English, "from free software to free money". lo s compared the 4 freedoms of free software with money, and what properties money needs to exhibit in order to be considered free. He then introduced 1, a project of free (as in free speech!) money, started in the region around Toulouse. Contrary to some distributed ledgers such as Bitcoin, 1 isn't based on an hash-based proof-of-work, but rather around a web of trust of people certifying each other, hence limiting the energy consumption required by the network to function. YunoHost Speaker: Jimmy Monin I then attended a presentation of YunoHost. Being an happy user myself, it was very nice to discover the future expected features, and also meet two of the developers. YunoHost is a Debian-based project, aimed at providing all the tools necessary to self-host applications, including email, website, calendar, development tools, and dozens of other packages. Premiers pas dans l'univers de Debian Speaker: Nicolas Dandrimont For the first talk of the MiniDebConf, Nicolas Dandrimont introduced Debian, its philosophy, and how it works with regards to upstreams and downstreams. He gave many details on the teams, the infrastructure, and the internals of Debian. Trusting your computer and system Speaker: Jonas Smedegaard Jonas introduced some security concepts, and how they are abused and often meaningless (to quote his own words, "secure is bullshit"). He described a few projects which lean towards a more secure and open hardware, for both phones and laptops. Automatiser la gestion de configuration de Debian avec Ansible Speaker: J r my Lecour J r my, from Evolix, introduced Ansible, and how they use it to manage hundreds of Debian servers. Ansible is a very powerful tool, and a huge ecosystem, in many ways similar to Puppet or Chef, except it is agent-less, using only ssh connections to communicate with remote machines. Very nice to compare their use of Ansible with mine, since that's the software I use at work for deploying experiments. Making Debian for everybody Speaker: Samuel Thibault Samuel gave a talk about accessibility, and the general availability of the tools in today's operating systems, including Debian. The lesson to take home is that we often don't do enough in this domain, particularly when considering some issues people might have that we don't always think about. Accessibility on computers (and elsewhere) should be the default, and never require complex setups. Retour d'exp rience : mise jour de milliers de terminaux Debian Speaker: Cyril Brulebois Cyril described a problem he was hired for, an update of thousands of Debian servers from wheezy to jessie, which he discovered afterwards was worse than initially thought, since the machines were running the out-of-date squeeze. Since they were not always administered with the best sysadmin practices, they were all exhibiting different configurations and different packages lists, which raised many issues and gave him interesting challenges. They were solved using Ansible, which also had the effect of standardizing their system administration practices. Retour d'exp rience : utilisation de Debian chez Evolix Speaker: Gr gory Colpart Gr gory described Evolix, a company which manages servers for their clients, and how they were inspired by Debian, for both their internal tools and their practices. It is very interesting to see that some of the Debian values can be easily exported for a more open and collaborative business. Lightning talks To close the conference, two lightning talks were presented, describing the switch from Windows XP to Debian in an ecologic association near Toulouse; and how snapshot.debian.org can be used with bisections to find the source of some regressions. Conclusion A big thank you to all the organizers and the associations who contributed to make this event a success. Cheers!

23 August 2017

Antoine Beaupr : The supposed decline of copyleft

At DebConf17, John Sullivan, the executive director of the FSF, gave a talk on the supposed decline of the use of copyleft licenses use free-software projects. In his presentation, Sullivan questioned the notion that permissive licenses, like the BSD or MIT licenses, are gaining ground at the expense of the traditionally dominant copyleft licenses from the FSF. While there does seem to be a rise in the use of permissive licenses, in general, there are several possible explanations for the phenomenon.

When the rumor mill starts Sullivan gave a recent example of the claim of the decline of copyleft in an article on Opensource.com by Jono Bacon from February 2017 that showed a histogram of license usage between 2010 and 2017 (seen below).
[Black Duck   histogram]
From that, Bacon elaborates possible reasons for the apparent decline of the GPL. The graphic used in the article was actually generated by Stephen O'Grady in a January article, The State Of Open Source Licensing, which said:
In Black Duck's sample, the most popular variant of the GPL version 2 is less than half as popular as it was (46% to 19%). Over the same span, the permissive MIT has gone from 8% share to 29%, while its permissive cousin the Apache License 2.0 jumped from 5% to 15%.
Sullivan, however, argued that the methodology used to create both articles was problematic. Neither contains original research: the graphs actually come from the Black Duck Software "KnowledgeBase" data, which was partly created from the old Ohloh web site now known as Open Hub. To show one problem with the data, Sullivan mentioned two free-software projects, GNU Bash and GNU Emacs, that had been showcased on the front page of Ohloh.net in 2012. On the site, Bash was (and still is) listed as GPLv2+, whereas it changed to GPLv3 in 2011. He also claimed that "Emacs was listed as licensed under GPLv3-only, which is a license Emacs has never had in its history", although I wasn't able to verify that information from the Internet archive. Basically, according to Sullivan, "the two projects featured on the front page of a site that was using [the Black Duck] data set were wrong". This, in turn, seriously brings into question the quality of the data:
I reported this problem and we'll continue to do that but when someone is not sharing the data set that they're using for other people to evaluate it and we see glimpses of it which are incorrect, that should give us a lot of hesitation about accepting any conclusion that comes out of it.
Reproducible observations are necessary to the establishment of solid theories in science. Sullivan didn't try to contact Black Duck to get access to the database, because he assumed (rightly, as it turned out) that he would need to "pay for the data under terms that forbid you to share that information with anybody else". So I wrote Black Duck myself to confirm this information. In an email interview, Patrick Carey from Black Duck confirmed its data set is proprietary. He believes, however, that through a "combination of human and automated techniques", Black Duck is "highly confident at the accuracy and completeness of the data in the KnowledgeBase". He did point out, however, that "the way we track the data may not necessarily be optimal for answering the question on license use trend" as "that would entail examination of new open source projects coming into existence each year and the licenses used by them". In other words, even according to Black Duck, its database may not be useful to establish the conclusions drawn by those articles. Carey did agree with those conclusions intuitively, however, saying that "there seems to be a shift toward Apache and MIT licenses in new projects, though I don't have data to back that up". He suggested that "an effective way to answer the trend question would be to analyze the new projects on GitHub over the last 5-10 years." Carey also suggested that "GitHub has become so dominant over the recent years that just looking at projects on GitHub would give you a reasonable sampling from which to draw conclusions".
[GitHub   graph]
Indeed, GitHub published a report in 2015 that also seems to confirm MIT's popularity (45%), surpassing copyleft licenses (24%). The data is, however, not without its own limitations. For example, in the above graph going back to the inception of GitHub in 2008, we see a rather abnormal spike in 2013, which seems to correlate with the launch of the choosealicense.com site, described by GitHub as "our first pass at making open source licensing on GitHub easier". In his talk, Sullivan was critical of the initial version of the site which he described as biased toward permissive licenses. Because the GitHub project creation page links to the site, Sullivan explained that the site's bias could have actually influenced GitHub users' license choices. Following a talk from Sullivan at FOSDEM 2016, GitHub addressed the problem later that year by rewording parts of the front page to be more accurate, but that any change in license choice obviously doesn't show in the report produced in 2015 and won't affect choices users have already made. Therefore, there can be reasonable doubts that GitHub's subset of software projects may not actually be that representative of the larger free-software community.

In search of solid evidence So it seems we are missing good, reproducible results to confirm or dispel these claims. Sullivan explained that it is a difficult problem, if only in the way you select which projects to analyze: the impact of a MIT-licensed personal wiki will obviously be vastly different from, say, a GPL-licensed C compiler or kernel. We may want to distinguish between active and inactive projects. Then there is the problem of code duplication, both across publication platforms (a project may be published on GitHub and SourceForge for example) but also across projects (code may be copy-pasted between projects). We should think about how to evaluate the license of a given project: different files in the same code base regularly have different licenses often none at all. This is why having a clear, documented and publicly available data set and methodology is critical. Without this, the assumptions made are not clear and it is unreasonable to draw certain conclusions from the results. It turns out that some researchers did that kind of open research in 2016 in a paper called "The Debsources Dataset: Two Decades of Free and Open Source Software" [PDF] by Matthieu Caneill, Daniel M. Germ n, and Stefano Zacchiroli. The Debsources data set is the complete Debian source code that covers a large history of the Debian project and therefore includes thousands of free-software projects of different origins. According to the paper:
The long history of Debian creates a perfect subject to evaluate how FOSS licenses use has evolved over time, and the popularity of licenses currently in use.
Sullivan argued that the Debsources data set is interesting because of its quality: every package in Debian has been reviewed by multiple humans, including the original packager, but also by the FTP masters to ensure that the distribution can legally redistribute the software. The existence of a package in Debian provides a minimal "proof of use": unmaintained packages get removed from Debian on a regular basis and the mere fact that a piece of software gets packaged in Debian means at least some users found it important enough to work on packaging it. Debian packagers make specific efforts to avoid code duplication between packages in order to ease security maintenance. The data set covers a period longer than Black Duck's or GitHub's, as it goes all the way back to the Hamm 2.0 release in 1998. The data and how to reproduce it are freely available under a CC BY-SA 4.0 license.
[Debsource   graph]
Sullivan presented the above graph from the research paper that showed the evolution of software license use in the Debian archive. Whereas previous graphs showed statistics in percentages, this one showed actual absolute numbers, where we can't actually distinguish a decline in copyleft licenses. To quote the paper again:
The top license is, once again, GPL-2.0+, followed by: Artistic-1.0/GPL dual-licensing (the licensing choice of Perl and most Perl libraries), GPL-3.0+, and Apache-2.0.
Indeed, looking at the graph, at most do we see a rise of the Apache and MIT licenses and no decline of the GPL per se, although its adoption does seem to slow down in recent years. We should also mention the possibility that Debian's data set has the opposite bias: toward GPL software. The Debian project is culturally quite different from the GitHub community and even the larger free-software ecosystem, naturally, which could explain the disparity in the results. We can only hope a similar analysis can be performed on the much larger Software Heritage data set eventually, which may give more representative results. The paper acknowledges this problem:
Debian is likely representative of enterprise use of FOSS as a base operating system, where stable, long-term and seldomly updated software products are desirable. Conversely Debian is unlikely representative of more dynamic FOSS environments (e.g., modern Web-development with micro libraries) where users, who are usually developers themselves, expect to receive library updates on a daily basis.
The Debsources research also shares methodology limitations with Black Duck: while Debian packages are reviewed before uploading and we can rely on the copyright information provided by Debian maintainers, the research also relies on automated tools (specifically FOSSology) to retrieve license information. Sullivan also warned against "ascribing reason to numbers": people may have different reasons for choosing a particular license. Developers may choose the MIT license because it has fewer words, for compatibility reasons, or simply because "their lawyers told them to". It may not imply an actual deliberate philosophical or ideological choice. Finally, he brought up the theory that the rise of non-copyleft licenses isn't necessarily at the detriment of the GPL. He explained that, even if there is an actual decline, it may not be much of a problem if there is an overall growth of free software to the detriment of proprietary software. He reminded the audience that non-copyleft licenses are still free software, according to the FSF and the Debian Free Software Guidelines, so their rise is still a positive outcome. Even if the GPL is a better tool to accomplish the goal of a free-software world, we can all acknowledge that the conversion of proprietary software to more permissive and certainly simpler licenses is definitely heading in the right direction.
[I would like to thank the DebConf organizers for providing meals for me during the conference.] Note: this article first appeared in the Linux Weekly News.

22 October 2016

Matthieu Caneill: Debugging 101

While teaching this semester a class on concurrent programming, I realized during the labs that most of the students couldn't properly debug their code. They are at the end of a 2-year cursus, know many different programming languages and frameworks, but when it comes to tracking down a bug in their own code, they often lacked the basics. Instead of debugging for them I tried to give them general directions that they could apply for the next bugs. I will try here to summarize the very first basic things to know about debugging. Because, remember, writing software is 90% debugging, and 10% introducing new bugs (that is not from me, but I could not find the original quote). So here is my take at Debugging 101. Use the right tools Many good tools exist to assist you in writing correct software, and it would put you behind in terms of productivity not to use them. Editors which catch syntax errors while you write them, for example, will help you a lot. And there are many features out there in editors, compilers, debuggers, which will prevent you from introducing trivial bugs. Your editor should be your friend; explore its features and customization options, and find an efficient workflow with them, that you like and can improve over time. The best way to fix bugs is not to have them in the first place, obviously. Test early, test often I've seen students writing code for one hour before running make, that would fail so hard that hundreds of lines of errors and warnings were outputted. There are two main reasons doing this is a bad idea: I recommend to test your code (compilation and execution) every few lines of code you write. When something breaks, chances are it will come from the last line(s) you wrote. Compiler errors will be shorter, and will point you to the same place in the code. Once you get more confident using a particular language or framework, you can write more lines at once without testing. That's a slow process, but it's ok. If you set up the right keybinding for compiling and executing from within your editor, it shouldn't be painful to test early and often. Read the logs Spot the places where your program/compiler/debugger writes text, and read it carefully. It can be your terminal (quite often), a file in your current directory, a file in /var/log/, a web page on a local server, anything. Learn where different software write logs on your system, and integrate reading them in your workflow. Often, it will be your only information about the bug. Often, it will tell you where the bug lies. Sometimes, it will even give you hints on how to fix it. You may have to filter out a lot of garbage to find relevant information about your bug. Learn to spot some keywords like error or warning. In long stacktraces, spot the lines concerning your files; because more often, your code is to be blamed, rather than deeper library code. grep the logs with relevant keywords. If you have the option, colorize the output. Use tail -f to follow a file getting updated. There are so many ways to grasp logs, so find what works best with you and never forget to use it! Print foobar That one doesn't concern compilation errors (unless it's a Makefile error, in that case this file is your code anyway). When the program logs and output failed to give you where an error occured (oh hi Segmentation fault!), and before having to dive into a memory debugger or system trace tool, spot the portion of your program that causes the bug and add in there some print statements. You can either print("foo") and print("bar"), just to know that your program reaches or not a certain place in your code, or print(some_faulty_var) to get more insights on your program state. It will give you precious information.
stderr >> "foo" >> endl;
my_db.connect(); // is this broken?
stderr >> "bar" >> endl;
In the example above, you can be sure it is the connection to the database my_db that is broken if you get foo and not bar on your standard error. (That is an hypothetical example. If you know something can break, such as a database connection, then you should always enclose it in a try/catch structure). Isolate and reproduce the bug This point is linked to the previous one. You may or may not have isolated the line(s) causing the bug, but maybe the issue is not always raised. It can depend on many other things: the program or function parameters, the network status, the amount of memory available, the decisions of the OS scheduler, the user rights on the system or on some files, etc. More generally, any assumption you made on any external dependency can appear to be wrong (even if it's right 99% of the time). According to the context, try to isolate the set of conditions that trigger the bug. It can be as simple as "when there is no internet connection", or as complicated as "when the CPU load of some external machine is too high, it's a leap year, and the input contains illegal utf-8 characters" (ok, that one is fucked up; but it surely happens!). But you need to reliably be able to reproduce the bug, in order to be sure later that you indeed fixed it. Of course when the bug is triggered at every run, it can be frustrating that your program never works but it will in general be easier to fix. RTFM Always read the documentation before reaching out for help. Be it man, a book, a website or a wiki, you will find precious information there to assist you in using a language or a specific library. It can be quite intimidating at first, but it's often organized the same way. You're likely to find a search tool, an API reference, a tutorial, and many examples. Compare your code against them. Check in the FAQ, maybe your bug and its solution are already referenced there. You'll rapidly find yourself getting used to the way documentation is organized, and you'll be more and more efficient at finding instantly what you need. Always keep the doc window open! Google and Stack Overflow are your friends Let's be honest: many of the bugs you'll encounter have been encountered before. Learn to write efficient queries on search engines, and use the knowledge you can find on questions&answers forums like Stack Overflow. Read the answers and comments. Be wise though, and never blindly copy and paste code from there. It can be as bad as introducing malicious security issues into your code, and you won't learn anything. Oh, and don't copy and paste anyway. You have to be sure you understand every single line, so better write them by hand; it's also better for memorizing the issue. Take notes Once you have identified and solved a particular bug, I advise to write about it. No need for shiny interfaces: keep a list of your bugs along with their solutions in one or many text files, organized by language or framework, that you can easily grep. It can seem slightly cumbersome to do so, but it proved (at least to me) to be very valuable. I can often recall I have encountered some buggy situation in the past, but don't always remember the solution. Instead of losing all the debugging time again, I search in my bug/solution list first, and when it's a hit I'm more than happy I kept it. Further reading degugging Remember this was only Debugging 101, that is, the very first steps on how to debug code on your own, instead of getting frustrated and helplessly stare at your screen without knowing where to begin. When you'll write more software, you'll get used to more efficient workflows, and you'll discover tools that are here to assist you in writing bug-free code and spotting complex bugs efficiently. Listed below are some of the tools or general ideas used to debug more complex software. They belong more to a software engineering course than a Debugging 101 blog post. But it's good to know as soon as possible these exist, and if you read the manuals there's no reason you can't rock with them! Don't hesitate to comment on this, and provide your debugging 101 tips! I'll be happy to update the article with valuable feedback. Happy debugging!

3 September 2016

Bits from Debian: New Debian Developers and Maintainers (July and August 2016)

The following contributors got their Debian Developer accounts in the last two months: The following contributors were added as Debian Maintainers in the last two months: Congratulations!

19 July 2016

Michael Prokop: DebConf16 in Capetown/South Africa: Lessons learnt

DebConf 16 in Capetown/South Africa was fantastic for many reasons. My Capetown/South Africa/Culture/Flight related lessons: My technical lessons from DebConf16: BTW, thanks to the video team the recordings from the sessions are available online.

8 February 2016

Orestis Ioannou: Debian - your patches and machine readable copyright files are available on Debsources

TL;DR All Debian license and patches are belong to us. Discover them here and here. In case you hadn't already stumbled upon sources.debian.net in the past, Debsources is a simple web application that allows to publish an unpacked Debian source mirror on the Web. On the live instance you can browse the contents of Debian source packages with syntax highlighting, search files matching a SHA-256 hash or a ctag, query its API, highlight lines, view accurate statistics and graphs. It was initially developed at IRILL by Stefano Zacchiroli and Matthieu Caneill. During GSOC 2015 I helped introduce two new features. License Tracker Since Debsources has all the debian/copyright files and that many of them adopted the DEP-5 suggestion (machine readable copyright files) it was interesting to exploit them for end users. You may find interesting the following features: Have a look at the documentation to discover more! Patch tracker The old patch tracker unfortunately died a while ago. Since Debsources stores all the patches it was, probably, natural for it to be able to exploit them and present them over the web. You can navigate through packages by prefix or by searching them here. Among the use cases: Read more about the API! Coming ... I hope you find these new features useful. Don't hesitate to report any bugs or suggestions you come accross.

15 August 2015

Matthieu Caneill: A one-liner to catch'em all!

I wrote a Bash one-liner to open the source code (in Debsources) of any file on your system (if it belongs to a Debian package). It will simply retrieve the associated package and point your default browser to its source code. Add this somewhere in your $PATH, and name this file debsrc:
#!/bin/bash
function debsrc  
    readlink -f $1   xargs dpkg-query --search   awk -F ": " ' print $1 '   xargs apt-cache showsrc   grep-dctrl -s 'Package' -n ''   awk -F " " ' print "http://sources.debian.net/src/"$1"/latest/" '   xargs x-www-browser
 
CMD="$1"
debsrc $ CMD 
And try something like debsrc /usr/share/doc/acpi/AUTHORS. Enjoy! Update: improved the one-liner thanks to josch's advice.

6 May 2015

Matthieu Caneill: Debsources got swag and continous integration

Debsources (http://sources.debian.net) is still under active development. We recently had a Gnome Outreachy intern, Jingjie Jiang, and we're about to work with 2 GSoC students, Cl ment Schreiner and Orestis Ioannou. I will present here the GitHub mirror we've set up, in order to allow external pull requests to be submitted, and to use the continous integration service provided by Travis-CI. GitHub and Travis-CI Debsources' source code is hosted on Debian's git servers, and from there is mirrored to GitHub. Every time a commit is pushed (to master or other branches) or a pull request is open, the test suite will be automatically run on Travis-CI, and the result (tests pass or don't) is displayed on GitHub. This allows us to quickly filter external contributions (when they are submitted on GitHub), and be sure everything works with our setup, before reviewing work. Travis-CI runs the tests on OpenVZ containers. The complete infrastructure was a bit challenging to setup, but as we now have a Docker recipe to quicly begin to hack on Debsources, most of the work could be done using the Dockerfile instructions. In average, a run on Travis-CI (which includes git cloning the code and test data, setup the server, and run the tests suite) takes 7 minutes, which is an ok amount of time to wait for before submitting a pull request, in my opinion. Bugs discovered in the process Setting up this continuous integration infrastructure made me discover a few bugs. Python magic does black magic Debsources runs fine on Debian (not surprisingly), but I got tricked by black magic when I tried to run it on Ubuntu (which is the OS run in Travis-CI's containers). We use the magic library to guess the type of files we're dealing with, for instance when we need to decide between rendering a file (for text files) or downloading it (for binary files). Here comes the tricky part: the Python bindings for libmagic are not the same in Debian and Pypi. Debsources uses Debian package python-magic, which is not in Ubuntu 12.04. Moreover, there's no Python egg for it on Pypi, which has however another package (called magic) which provides a different API. I solved this with a dirty hack, using the fact python-magic lies in a single file:
mkdir /tmp/python-magic && wget https://raw.githubusercontent.com/file/file/master/python/magic.py -O /tmp/python-magic/magic.py && export PYTHONPATH=/tmp/python-magic/:$PYTHONPATH
It simply downloads the library, saves it in a temporary folder and includes it in the Python path. Let's see for how long it works before everything breaks! Size of a directory One test in the suite was ensuring the information returned by ls -l on a directory and stored in the DB was the right information. Inode metadata was tested, such as name, permissions, type, or size. Interestingly enough, the size of a directory was tested, and expected to be 4096 bytes. The size of a directory actually depends on the filesystem in use, and on the number of files this directory contains. We often see 4096 because it's the size of a not-too-big directory on ext4. Travis-CI doesn't use ext4:
$ df -T
Filesystem            Type     1K-blocks      Used Available Use%
Mounted on
/vz/private/209140041 simfs    125829120 103460612  22368508  83% /
none                  devtmpfs   1572864         8   1572856   1% /dev
none                  tmpfs       314576        56    314520   1% /run
none                  tmpfs         5120         4      5116   1%
/run/lock
none                  tmpfs      1572864         0   1572864   0%
/run/shm
/dev/null             tmpfs       786432    171584    614848  22%
/var/ramfs
Simfs is a container filesystem for OpenVZ, on which directories have different sizes than on ext4:
$ ls -al /
total 0
drwxr-xr-x 23 root     root      480 Feb  4 18:08 .
drwxr-xr-x 23 root     root      480 Feb  4 18:08 ..
drwxr-xr-x  2 root     root     2480 Feb  4 18:20 bin
drwxr-xr-x  2 root     root       40 Apr 19  2012 boot
drwxr-xr-x  5 root     root      660 Apr 30 13:56 dev
drwxr-xr-x 99 root     root     3560 Apr 30 13:56 etc
-rw-r--r--  1 root     root        0 Feb  4 17:56 fastboot
drwxr-xr-x  3 root     root       80 Feb  4 17:57 home
[...]
Directory sizes are not even powers of 2. Hence I changed the test to not check directory sizes. Hopefully this will help to make Debsources work on more filesystems! An empty file is hiding Last but not least, because this bug is still open in the wild. A file, which appears to be empty, is not taken into account by Debsources' updater. This file is sources/non-free/m/make-doc-non-dfsg/4.0-2/.pc/applied-patches. It is present in the filesystem in the container, is not the only empty file over there, but still doesn't appear in the database, and make fail the test which counts files. The test has been commented out (booooooh), so that we still can use Travis-CI's platform for our GSoC students, before it's fixed. Conclusion Making Debsources run automatically on a different platform as the one we usually use permitted us to spot bugs, write dirty hacks, and expand the filesystems it's supposed to run on. Now, let's hope the continuous integration will help our GSoC students, and let's wish them good luck!

6 June 2014

Stefano Zacchiroli: debsources paper at ESEM2014

Debsources: Live and Historical Views on Macro-Level Software Evolution The paper entitled Debsources: Live and Historical Views on Macro-Level Software Evolution, which I've co-authored with Matthieu Caneill, has been accepted at ESEM 2014: the 8th international symposium on Emprical Software Engineering and Measurement. In the paper we have described Debsources as a software platform for monitoring the evolution of Free Software through the lenses of Debian, and used the main Debsources instance (http://sources.debian.net) to replicate and extend a former study on macro-level software evolution. Now we "just" have to integrate all the nice charts and data we have extracted for the paper into Debsources' stats page... /o\

27 February 2014

Stefano Zacchiroli: moar stats for sources.debian.net

Debian: watch your stats! Over the past few weeks, myself and Matthieu Caneill have worked quite a bit on Debsources. As we have now deployed most of the new features on http://sources.debian.net, it's time for another "What's new with Debsources?" blog post. Here is what's new: Want more? Sure, we'll be happy to! But it'll happen faster if you help. Speaking of which: we've got Debsources into the new contributors game (see announcement) and we're looking forward to mentor new contributors.

17 September 2013

Stefano Zacchiroli: sources.debian.net - advanced search and other news

all your ctag (and checksum) are belong to us A few months after the initial announcement, here are some news about the sources.d.n service. I've been late in blogging this, but most of it has been implemented by myself and Matthieu Caneill during DebConf13, which has been a great DebConf, totally exceeding my expectations (and they were already fairly high!). First, you might have noticed some user-visible changes: On the other hand, under the hood: As you usual, your bug reports (and patches!) are more than welcome, just check BUGS before reporting to avoid duplicates.
That's all!

2 July 2013

Bits from Debian: all Debian source are belong to us

This is a verbatim repost from Stefano Zacchiroli's post TL;DR: go to http://sources.debian.net and enjoy.
Debsources is a new toy I've been working on at IRILL together with Matthieu Caneill. In essence, debsources is a simple web application that allows to publish an unpacked Debian source mirror on the Web. You can deploy Debsources where you please, but there is a main instance at http://sources.debian.net (sources.d.n for short) that you will probably find interesting. sources.d.n follows closely the Debian archive in two ways:
  1. it is updated 4 times a day to reflect the content of the Debian archive
  2. it contains sources coming from official Debian suites: the usual ones (from oldstable to experimental), *-updates (ex volatile), *-proposed-updates, and *-backports (from Wheezy on)
Via sources.d.n you can therefore browse the content of Debian source packages with usual code viewing features like syntax highlighting. More interestingly, you can search through the source code (of unstable only, though) via integration with http://codesearch.debian.net. You can also use sources.d.n programmatically to query available versions or link to specific lines, with the possibility of adding contextual pop-up messages (example). In fact, you might have stumbled upon sources.d.n already in the past few days, via other popular Debian services where it has already been integrated. In particular: codesearch.d.n now defaults to show results via sources.d.n, and the PTS has grown new "browse source code" hyperlinks that point to it. If you've ideas of other Debian services where sources.d.n should be integrated, please let me know. I find Debsources and sources.d.n already quite useful but, as it often happens, there is still a lot TODO. Obviously, it is all Free Software (released under GNU AGPLv3). Do not hesitate to report new bugs and, better, to submit patches for the outstanding ones. Acknowledgements PS in case you were wondering: at present sources.d.n requires ~381 GB of disk space to hold all uncompressed source packages, plus ~83 GB for the local (compressed) source mirror

Stefano Zacchiroli: introducing sources.debian.net

all Debian source are belong to us TL;DR: go to http://sources.debian.net and enjoy.
Debsources is a new toy I've been working on at IRILL together with Matthieu Caneill. In essence, debsources is a simple web application that allows to publish an unpacked Debian source mirror on the Web. You can deploy Debsources where you please, but there is a main instance at http://sources.debian.net (sources.d.n for short) that you will probably find interesting. sources.d.n follows closely the Debian archive in two ways:
  1. it is updated 4 times a day to reflect the content of the Debian archive
  2. it contains sources coming from official Debian suites: the usual ones (from oldstable to experimental), *-updates (ex volatile), *-proposed-updates, and *-backports (from Wheezy on)
Via sources.d.n you can therefore browse the content of Debian source packages with usual code viewing features like syntax highlighting. More interestingly, you can search through the source code (of unstable only, though) via integration with http://codesearch.debian.net. You can also use sources.d.n programmatically to query available versions or link to specific lines, with the possibility of adding contextual pop-up messages (example). In fact, you might have stumbled upon sources.d.n already in the past few days, via other popular Debian services where it has already been integrated. In particular: codesearch.d.n now defaults to show results via sources.d.n, and the PTS has grown new "browse source code" hyperlinks that point to it. If you've ideas of other Debian services where sources.d.n should be integrated, please let me know. I find Debsources and sources.d.n already quite useful but, as it often happens, there is still a lot TODO. Obviously, it is all Free Software (released under GNU AGPLv3). Do not hesitate to report new bugs and, better, to submit patches for the outstanding ones. Acknowledgements PS in case you were wondering: at present sources.d.n requires ~381 GB of disk space to hold all uncompressed source packages, plus ~83 GB for the local (compressed) source mirror