Russ Allbery: Review: Driving the Deep
Series: | Finder Chronicles #2 |
Publisher: | DAW |
Copyright: | 2020 |
ISBN: | 0-7564-1512-8 |
Format: | Kindle |
Pages: | 426 |
Series: | Finder Chronicles #2 |
Publisher: | DAW |
Copyright: | 2020 |
ISBN: | 0-7564-1512-8 |
Format: | Kindle |
Pages: | 426 |
The Reproducible Builds community sadly announces it has lost its founding member.
J r my Bobbio aka Lunar passed away on Friday November 8th in palliative care in Rennes, France.
Lunar was instrumental in starting the Reproducible Builds project in 2013 as a loose initiative within the Debian project. Many of our earliest status reports were written by him and many of our key tools in use today are based on his design.
Lunar was a resolute opponent of surveillance and censorship, and he possessed an unwavering energy that fueled his work on Reproducible Builds and Tor. Without Lunar s far-sightedness, drive and commitment to enabling teams around him, Reproducible Builds and free software security would not be in the position it is in today. His contributions will not be forgotten, and his high standards and drive will continue to serve as an inspiration to us as well as for the other high-impact projects he was involved in.
Lunar s creativity, insight and kindness were often noted. He will be greatly missed.
Other tributes:
Although it is possible to increase confidence in Free and Open Source Software (FOSS) by reviewing its source code, trusting code is not the same as trusting its executable counterparts. These are typically built and distributed by third-party vendors with severe security consequences if their supply chains are compromised. In this paper, we present reproducible builds, an approach that can determine whether generated binaries correspond with their original source code. We first define the problem and then provide insight into the challenges of making real-world software build in a "reproducible" manner that is, when every build generates bit-for-bit identical results. Through the experience of the Reproducible Builds project making the Debian Linux distribution reproducible, we also describe the affinity between reproducibility and quality assurance (QA).According to Google Scholar, the paper has accumulated almost 40 citations since publication. The full text of the paper can be found in PDF format.
The folks from the Reproducibility Project have come a long way since they started working on it 10 years ago, and we believe it s time for the next step in Debian. Several weeks ago, we enabled a migration policy in our migration software that checks for regression in reproducibility. At this moment, that is presented as just for info, but we intend to change that to delays in the not so distant future. We eventually want all packages to be reproducible. To stimulate maintainers to make their packages reproducible now, we ll soon start to apply a bounty [speedup] for reproducible builds, like we ve done with passing autopkgtests for years. We ll reduce the bounty for successful autopkgtests at that moment in time.
What we have done, explains Sollins, is to develop, prove correct, and demonstrate the viability of an approach that allows the [software] maintainers to remain anonymous. Preserving anonymity is obviously important, given that almost everyone software developers included value their confidentiality. This new approach, Sollins adds, simultaneously allows [software] users to have confidence that the maintainers are, in fact, legitimate maintainers and, furthermore, that the code being downloaded is, in fact, the correct code of that maintainer. [ ]The corresponding paper is published on the arXiv preprint server in various formats, and the announcement has also been covered in MIT News.
I noticed that a small but fixed subset of [Git] repositories are getting backed up despite having no changes made. That is odd because I would think that repeated bundling of the same repository state should create the exact same bundle. However [it] turns out that for some, repositories bundling is nondeterministic.Paul goes on to to describe his solution, which involves forcing git to be single threaded makes the output deterministic . The article was also discussed on Hacker News.
libxlst
now deterministic
generate-id()
XSLT function is now deterministic across multiple transformations, fixing many issues with reproducible builds. As the Git commit by Nick Wellnhofer describes:
Rework the generate-id() function to return deterministic values. We use
a simple incrementing counter and store ids in the 'psvi' member of
nodes which was freed up by previous commits. The presence of an id is
indicated by a new "source node" flag.
This fixes long-standing problems with reproducible builds, see
https://bugzilla.gnome.org/show_bug.cgi?id=751621
This also hardens security, as the old implementation leaked the
difference between a heap and a global pointer, see
https://bugs.chromium.org/p/chromium/issues/detail?id=1356211
The old implementation could also generate the same id for dynamically
created nodes which happened to reuse the same memory. Ids for namespace
nodes were completely broken. They now use the id of the parent element
together with the hex-encoded namespace prefix.
generate-draft
script to not blow up if the input files have been corrupted today or even in the past [ ], Holger Levsen updated the Hamburg 2023 summit to add a link to farewell post [ ] & to add a picture of a Post-It note. [ ], and Pol Dellaiera updated the paragraph about tar
and the --clamp-mtime
flag [ ].
On our mailing list this month, Bernhard M. Wiedemann posted an interesting summary on some of the reasons why packages are still not reproducible in 2023.
objdump
symbol comment filter inputs as Python byte
(and not str
) instances [ ] and Vagrant Cascadian extended diffoscope support for GNU Guix [ ] and updated the version in that distribution to version 253 [ ].
deep-dive into 6 tools and the accuracy of the SBOMs they produce for complex open-source Java projects. Our novel insights reveal some hard challenges regarding the accurate production and usage of software bills of materials.The paper is available on arXiv.
crack
[ ] (#1021521 & #1021522)dustmite
[ ] (#1020878 & #1020879)edid-decode
[ ] (#1020877)gentoo
[ ] (#1024284)haskell98-report
[ ] (#1024007)infinipath-psm
[ ] (#990862)lcm
[ ] (#1024286)libapache-mod-evasive
[ ] (#1020800)libccrtp
[ ] (#860470)libinput
[ ] (#995809)lirc
[ ] (#979019, #979023 & #979024)mm-common
[ ] (#977177)mpl-sphinx-theme
[ ] (#1005826)psi
[ ] (#1017473)python-parse-type
[ ] (#1002671)ruby-tioga
[ ] (#1005727)ucspi-proxy
[ ] (#1024125)ypserv
[ ] (#983138).buildinfo
files in Debian trixie, specifically lorene
(0.0.0~cvs20161116+dfsg-1.1), maria
(1.3.5-4.2) and ruby-rinku
(1.7.3-2.1).
create-meta-pkgs
tool. [ ][ ]python3-setuptools
and swig
packages, which are now needed to build OpenWrt. [ ]pkg-config
needed to build Coreboot artifacts. [ ]fakeroot
tool is implicitly required but not automatically installed. [ ]vmlinuz
file. [ ]freebsd-jenkins.debian.net
has been updated to FreeBSD 14.0. [ ]apr
(hostname issue)dune
(parallelism)epy
(time-based .pyc
issue)fpc
(Year 2038)gap
(date)gh
(FTBFS in 2024)kubernetes
(fixed random build path)libgda
(date)libguestfs
(tar)metamail
(date)mpi-selector
(date)neovim
(randomness in Lua)nml
(time-based .pyc
)pommed
(parallelism)procmail
(benchmarking)pysnmp
(FTBFS in 2038)python-efl
(drop Sphinx doctrees)python-pyface
(time)python-pytest-salt-factories
(time-based .pyc
issue)python-quimb
(fails to build on single-CPU systems)python-rdflib
(random)python-yarl
(random path)qt6-webengine
(parallelism issue in documentation)texlive
(Gzip modification time issue)waf
(time-based .pyc
)warewulf
(CPIO modification time and inode issue)xemacs
(toolchain hostname)python-aiostream
.openpyxl
.python-multipletau
.wxmplot
.stunnel4
.qttools-opensource-src
.#reproducible-builds
on irc.oftc.net
.
rb-general@lists.reproducible-builds.org
--remap-path-prefix
solves this problem and has been used to great effect in build systems that rely on reproducibility (Bazel, Nix) to work at all and that there are efforts to teach cargo about it here .
TheAs their announcement later goes onto state, version-pinning using hash-checking mode can prevent this attack, although this does depend on specific installations using this mode, rather than a prevention that can be applied systematically.ctx
hosted project on PyPI was taken over via user account compromise and replaced with a malicious project which contained runtime code which collected the content ofos.environ.items()
when instantiating Ctx objects. The captured environment variables were sent as a base64 encoded query parameter to a Heroku application [ ]
.jar
may have been unnecessary given that diffoscope would have identified the, it must be said that there is something to be said with occasionally delving into seemingly low-level details, as well describing any debugging process. Indeed, as vanitasvitae writes:
Yes, this would have spared me from 3h of debugging But I probably would also not have gone onto this little dive into the JAR/ZIP format, so in the end I m not mad.
KBUILD_BUILD_TIMESTAMP
) in order to prepare my build with the known to disrupt code layout options disabled .
nondeterministic_checksum_generated_by_coq
and nondetermistic_js_output_from_webpack
.
After Holger Levsen found hundreds of packages in the bookworm distribution that lack .buildinfo
files, he uploaded 404 source packages to the archive (with no meaningful source changes). Currently bookworm now shows only 8 packages without .buildinfo
files, and those 8 are fixed in unstable and should migrate shortly. By contrast, Debian unstable will always have packages without .buildinfo
files, as this is how they come through the NEW queue. However, as these packages were not built on the official build servers (ie. they were uploaded by the maintainer) they will never migrate to Debian testing. In the future, therefore, testing should never have packages without .buildinfo
files again.
Roland Clobus posted yet another in-depth status report about his progress making the Debian Live images build reproducibly to our mailing list. In this update, Roland mentions that all major desktops build reproducibly with bullseye, bookworm and sid but also goes on to outline the progress made with automated testing of the generated images using openQA.
FORCE_SOURCE_DATE=1
in the environment of all builds in order to fix numerous timestamp issues in documentation generation tools.
maradns
package as it appears to embed a random prime number. (Patch)
This paper focuses on one research question: how can [Guix]((https://www.gnu.org/software/guix/) and similar systems allow users to securely update their software? [ ] Our main contribution is a model and tool to authenticate new Git revisions. We further show how, building on Git semantics, we build protections against downgrade attacks and related threats. We explain implementation choices. This work has been deployed in production two years ago, giving us insight on its actual use at scale every day. The Git checkout authentication at its core is applicable beyond the specific use case of Guix, and we think it could benefit to developer teams that use Git.A full PDF of the text is available.
215
, 216
and 217
to Debian unstable. Chris Lamb also made the following changes:
--profile
and we were killed via a TERM
signal. This should help in situations where diffoscope is terminated due to some sort of timeout. [ ]IndexError
exceptions (in addition to ValueError
) when parsing .pyc
files. (#1012258)argcomplete
module. [ ]readelf
(ie. binutils), as it appears that this patch level version change resulted in a change of output, not the minor version. [ ]@skip_unless_tool_is_at_least
decorator (NB. at_least
) over @skip_if_tool_version_is
(NB. is
) to fix tests under Debian stable. [ ]TERM
signal. [ ]build-compare
caused a regression for a few days.python-fasttext
(CPU-related issue).node-dommatrix
.rtpengine
.sphinxcontrib-mermaid
.yaru-theme
.mapproxy
(forwarded upstream).libxsmm
.yt-dlp
(forwarded upstream).lz4
, lzop
and xz-utils
packages on all nodes in order to detect running kernels. [ ]SOURCE_DATE_EPOCH
environment variable [ ]. In addition, Sebastian Crane very-helpfully updated the screenshot of salsa.debian.org s request access button on the How to join the Salsa group. [ ]
#reproducible-builds
on irc.oftc.net
.
rb-general@lists.reproducible-builds.org
Various efforts towards build verifiability have been made to C/C++-based systems, yet the techniques for Java-based systems are not systematic and are often specific to a particular build tool (eg. Maven). In this study, we present a systematic approach towards build verifiability on Java-based systems.
We first define the problem, and then provide insight into the challenges of making real-world software build in a reproducible manner-this is, when every build generates bit-for-bit identical results. Through the experience of the Reproducible Builds project making the Debian Linux distribution reproducible, we also describe the affinity between reproducibility and quality assurance (QA).
SOURCE_DATE_EPOCH
specification related to formats that cannot help embedding potentially timezone-specific timestamp. (Full thread index.)
203
, 204
, 205
and 206
to Debian unstable, as well as made the following changes to the code itself:
file(1)
-related regression where Debian .changes
files that contained non-ASCII text were not identified as such, therefore resulting in seemingly arbitrary packages not actually comparing the nested files themselves. The non-ASCII parts were typically in the Maintainer
or in the changelog text. [ ][ ]binwalk
, return False
from BinwalkFile.recognizes
. [ ]binwalk
, don t report that we are missing the Python rpm
module! [ ]diffoscope
and diffoscope-minimal
packages have the same version. [ ]
debian-devel
mailing list after noticing that the binutils
source package contained unreproducible logs in one of its binary packages. Vagrant expanded the discussion to one about all kinds of build metadata in packages and outlines a number of potential solutions that support reproducible builds and arbitrary metadata.
Vagrant also started a discussion on debian-devel
after identifying a large number of packages that embed build paths via RPATH when building with CMake, including a list of packages (grouped by Debian maintainer) affected by this issue. Maintainers were requested to check whether their package still builds correctly when passing the -DCMAKE_BUILD_RPATH_USE_ORIGIN=ON
directive.
On our mailing list this month, kpcyrd announced the release of rebuilderd-debian-buildinfo-crawler a tool to parse the Packages.xz
Debian package index file, attempts to discover the right .buildinfo
file from buildinfos.debian.net and outputs it in a format that can be understood by rebuilderd. The tool, which is available on GitHub, solves a problem regarding correlating Debian version numbers with their builds.
bauen1 provided two patches for debian-cd, the software used to make Debian installer images. This involved passing --invariant
and -i deb00001
to mkfs.msdos(8)
and avoided embedding timestamps into the gzipped Packages
and Translations
files. After some discussion, the patches in question were merged and will be included in debian-cd version 3.1.36.
Roland Clobus wrote another in-depth status update about status of live Debian images, summarising the current situation that all major desktops build reproducibly with bullseye, bookworm and sid .
python3.10
package was uploaded to Debian by doko, fixing an issue where [.pyc
files were not reproducible because the elements in frozenset
data structures were not ordered reproducibly. This meant that to creating a bit-for-bit reproducible Debian chroot which included .pyc
files was not reproducible. As of writing, the only remaining unreproducible parts of a standard
chroot is man-db
, but Guillem Jover has a patch for update-alternatives
which will likely be part of the next release of dpkg
.
Elsewhere in Debian, 139 reviews of Debian packages were added, 29 were updated and 17 were removed this month adding to our knowledge about identified issues. A large number of issue types have been updated too, including the addition of captures_kernel_variant
, erlang_escript_file
, captures_build_path_in_r_rdb_rds_databases
, captures_build_path_in_vo_files_generated_by_coq
and build_path_in_vo_files_generated_by_coq
.
contributors.sh
Bash/shell script into a Python script. [ ][ ][ ]btop
(sort-related issue)complexity
(date)giac
(update the version with upstreamed date patch)htcondor
(use CMake timestamp)libint
(readdir
system call related)libnet
(date-related issue)librime-lua
(sort filesystem ordering)linux_logo
(sort-related issue)micro-editor
(date-related issue)openvas-smb
(date-related issue)ovmf
(sort-related issue)paperjam
(date-related issue)python-PyQRCode
(date-related issue)quimb
(single-CPU build failure)radare2
(Meson date/time-related issue)radare2
(Rework SOURCE_DATE_EPOCH
usage to be portable)siproxd
(date, with Sebastian Kemper + follow-upxonsh
(Address Space Layout Randomisation-related issue)xsnow
(date & tar(1)
-related issue)zip
(toolchain issue related to filesystem ordering)ltsp
(forwarded upstream).pcmemtest
.hatchling
.mpl-sphinx-theme
(forwarded upstream)gap-hapcryst
.tree-puzzle
.jcabi-aspects
.paper-icon-theme
.wcwidth
.xir
.xir
.ruby-github-markup
.ruby-tioga
.btop
.libadwaita-1
.snibbetracker
.cctbx
.mdnsd
.gmerlin
.beav
.krita
.qt6-base
.onevpl-intel-gpu
.ruby3.0
.nix
.foma
.ruby3.0
.openwrt.git
repository the next day.
useradd
warnings when building packages. [ ]armhf
architecture nodes to add a hint to where nodes named virt-*
. [ ]logrotate
and man-db
services. [ ]#reproducible-builds
on irc.oftc.net
.
rb-general@lists.reproducible-builds.org
Although it is possible to increase confidence in Free and Open Source Software (FOSS) by reviewing its source code, trusting code is not the same as trusting its executable counterparts. These are typically built and distributed by third-party vendors with severe security consequences if their supply chains are compromised. In this paper, we present reproducible builds, an approach that can determine whether generated binaries correspond with their original source code. We first define the problem and then provide insight into the challenges of making real-world software build in a "reproducible" manner that is, when every build generates bit-for-bit identical results. Through the experience of the Reproducible Builds project making the Debian Linux distribution reproducible, we also describe the affinity between reproducibility and quality assurance (QA).The full text of the paper can be found in PDF format and should appear, with an alternative layout, within a forthcoming issue of the physical IEEE Software magazine.
The Debian Janitor is an automated system that commits fixes for (minor) issues in Debian packages that can be fixed by software. It gradually started proposing merges in early December. The first set of changes sent out ran lintian-brush on sid packages maintained in Git. This post is part of a series about the progress of the Janitor. Linux distributions like Debian fulfill an important function in the FOSS ecosystem - they are system integrators that take existing free and open source software projects and adapt them where necessary to work well together. They also make it possible for users to install more software in an easy and consistent way and with some degree of quality control and review. One of the consequences of this model is that the distribution package often lags behind upstream releases. This is especially true for distributions that have tighter integration and standardization (such as Debian), and often new upstream code is only imported irregularly because it is a manual process - both updating the package, but also making sure that it still works together well with the rest of the system. The process of importing a new upstream used to be (well, back when I started working on Debian packages) fairly manual and something like this:
1 2 | version=4
http://somesite.com/dir/filenamewithversion.tar.gz
|
1 2 3 | ---
Repository: https://www.dulwich.io/code/dulwich/
Repository-Browse: https://www.dulwich.io/code/dulwich/
|
1 2 3 4 5 6 | echo deb "[arch=amd64 signed-by=/usr/share/keyrings/debian-janitor-archive-keyring.gpg]" \
https://janitor.debian.net/ fresh-snapshots main sudo tee /etc/apt/sources.list.d/fresh-snapshots.list
echo deb "[arch=amd64 signed-by=/usr/share/keyrings/debian-janitor-archive-keyring.gpg]" \
https://janitor.debian.net/ fresh-releases main sudo tee /etc/apt/sources.list.d/fresh-releases.list
sudo curl -o /usr/share/keyrings/debian-janitor-archive-keyring.gpg https://janitor.debian.net/pgp_keys
apt update
|
1 | apt install -t fresh-snapshots r-cran-roxygen2
|
[1] | I m not saying that a monoculture is great here, but it does help distributions. |
README
(#25).circlator
, dvbstreamer
, eric
, jbbp
, knot-resolver
, libjs-qunit
, mail-expire
, osmo-mgw
, python-pyramid
, pyvows
& sayonara
.
debian/copyright
file to match the copyright notices in the source tree. (#224).py
copyright headers. [...]readelf(1)
. [...]minimal
instead of basic
as a variable name to match the underlying package name. [...]pprint.pformat
in the JSON comparator to serialise the differences from jsondiff
. [...]python-django
:
2.2.17-2
Fix compatibility with GNU gettext version 0.21. (#978263)3.1.4-1
New upstream bugfix release.redis
:
mtools
(4.0.26-1
) New upstream release.
adminer
(4.7.8-2
) on behalf of Alexandre Rossi and performed two QA uploads of sendfile
(2.1b.20080616-7
and 2.1b.20080616-8
) to make the build the build reproducible (#776938) and to fix a number of other unrelated issues.
Debian LTS
This month I have worked 18 hours on Debian Long Term Support (LTS) and 12 hours on its sister Extended LTS project.
awstats
, imagemagick
, node-ini
, openexr
, openssl1.0
, p11-kit
, pypy
, python-py
, sqlite3
, sympa
, etc.
node-ini
, an .ini
configuration file format parser/serialiser for Node.js, where an application could be exploited by a malicious input file.
build_path_captured_by_pyuic5
, build_path_captured_by_octave
& build_path_captured_by_nim
.
SOURCE_DATE_EPOCH
is not Debian specific [...] and make a number of misc cosmetic changes [...][...].--load-existing-diff
command. [...]diffoscope-minimal
package that was introduced by Mattia Rizzolo has a different short description from the primary diffoscope
one. [...]2.2.17-1
& 3.1.3-1
) New upstream releases.
1.6.9+dfsg-1
) New upstream release.
2.101.0
, 2.102.0
, 2.103.0
& 2.104.0
) New upstream releases.
2.14
) Mark an autopkgtest
as 'superficial'. (#974491)
2.1-1
) New upstream release.
3.1.2+dfsg-3
) Re-upload a previous QA upload of mine (3.1.2+dfsg-2
) to ensure the package's transition to the testing distribution. (#974872)
minidlna
package which could not be successfully purged from the system without reporting a cannot remove '/var/log/minidlna'
error. (#975372)
codemirror-js
, glibc
, jupyter-notebook
, krb5
, libhibernate3-java
, raptor2
, spice-vdagent
& webcit
.
CVE-2020-26939
)
gdm3
) where gdm3
detecting any users may have caused gdm3
to launch the initial system setup, permitting the creation of new users with superuser capabilities. (CVE-2020-16125
)
sddm
display manager. Here, local and unprivileged users could create a connection to the X server. (CVE-2020-28049
)
CVE-2020-28196
)
raptor2
, a set of parsers for Resource Description Framework (RDF) files used in LibreOffice and other applications. (CVE-2017-18926
)
CVE-2020-28948
& CVE-2020-28949
)
The Debian Janitor is an automated
system that commits fixes for (minor) issues in Debian packages that can be
fixed by software. It gradually started proposing merges in early
December. The first set of changes sent out ran lintian-brush on sid packages maintained in
Git. This post is part of a series about the progress of the
Janitor.
The Janitor knows how to talk to different hosting platforms.
For each hosting platform, it needs to support the platform-
specific API for creating and managing merge proposals.
For each hoster it also needs to have credentials.
At the moment, it supports the GitHub API,
Launchpad API and GitLab API. Both GitHub and Launchpad have only a
single instance; the GitLab instances it supports are gitlab.com and salsa.debian.org.
This provides coverage for the vast majority of Debian packages
that can be accessed using Git. More than 75% of all packages
are available on salsa - although in some cases, the Vcs-Git
header has not yet been updated.
Of the other 25%, the majority either does not declare where
it is hosted using a Vcs-* header (10.5%), or have not
yet migrated from alioth to another hosting platform
(9.7%). A further 2.3% are hosted somewhere on
GitHub (2%),
Launchpad (0.18%) or
GitLab.com (0.15%), in many cases
in the same repository as the upstream code.
The remaining 1.6% are hosted on many other hosts, primarily
people s personal servers (which usually don t have an
API for creating pull requests).
Hoster | Open | Merged & Applied | Closed |
github.com | 92 | 168 | 5 |
gitlab.com | 12 | 3 | 0 |
code.launchpad.net | 24 | 51 | 1 |
salsa.debian.org | 1,360 | 5,657 | 126 |
For more information about the Janitor's lintian-fixes efforts, see the landing page.
From that, Bacon elaborates possible reasons for the apparent decline of the GPL. The graphic used in the article was actually generated by Stephen O'Grady in a January article, The State Of Open Source Licensing, which said:![]()
In Black Duck's sample, the most popular variant of the GPL version 2 is less than half as popular as it was (46% to 19%). Over the same span, the permissive MIT has gone from 8% share to 29%, while its permissive cousin the Apache License 2.0 jumped from 5% to 15%.Sullivan, however, argued that the methodology used to create both articles was problematic. Neither contains original research: the graphs actually come from the Black Duck Software "KnowledgeBase" data, which was partly created from the old Ohloh web site now known as Open Hub. To show one problem with the data, Sullivan mentioned two free-software projects, GNU Bash and GNU Emacs, that had been showcased on the front page of Ohloh.net in 2012. On the site, Bash was (and still is) listed as GPLv2+, whereas it changed to GPLv3 in 2011. He also claimed that "Emacs was listed as licensed under GPLv3-only, which is a license Emacs has never had in its history", although I wasn't able to verify that information from the Internet archive. Basically, according to Sullivan, "the two projects featured on the front page of a site that was using [the Black Duck] data set were wrong". This, in turn, seriously brings into question the quality of the data:
I reported this problem and we'll continue to do that but when someone is not sharing the data set that they're using for other people to evaluate it and we see glimpses of it which are incorrect, that should give us a lot of hesitation about accepting any conclusion that comes out of it.Reproducible observations are necessary to the establishment of solid theories in science. Sullivan didn't try to contact Black Duck to get access to the database, because he assumed (rightly, as it turned out) that he would need to "pay for the data under terms that forbid you to share that information with anybody else". So I wrote Black Duck myself to confirm this information. In an email interview, Patrick Carey from Black Duck confirmed its data set is proprietary. He believes, however, that through a "combination of human and automated techniques", Black Duck is "highly confident at the accuracy and completeness of the data in the KnowledgeBase". He did point out, however, that "the way we track the data may not necessarily be optimal for answering the question on license use trend" as "that would entail examination of new open source projects coming into existence each year and the licenses used by them". In other words, even according to Black Duck, its database may not be useful to establish the conclusions drawn by those articles. Carey did agree with those conclusions intuitively, however, saying that "there seems to be a shift toward Apache and MIT licenses in new projects, though I don't have data to back that up". He suggested that "an effective way to answer the trend question would be to analyze the new projects on GitHub over the last 5-10 years." Carey also suggested that "GitHub has become so dominant over the recent years that just looking at projects on GitHub would give you a reasonable sampling from which to draw conclusions".
Indeed, GitHub published a report in 2015 that also seems to confirm MIT's popularity (45%), surpassing copyleft licenses (24%). The data is, however, not without its own limitations. For example, in the above graph going back to the inception of GitHub in 2008, we see a rather abnormal spike in 2013, which seems to correlate with the launch of the choosealicense.com site, described by GitHub as "our first pass at making open source licensing on GitHub easier". In his talk, Sullivan was critical of the initial version of the site which he described as biased toward permissive licenses. Because the GitHub project creation page links to the site, Sullivan explained that the site's bias could have actually influenced GitHub users' license choices. Following a talk from Sullivan at FOSDEM 2016, GitHub addressed the problem later that year by rewording parts of the front page to be more accurate, but that any change in license choice obviously doesn't show in the report produced in 2015 and won't affect choices users have already made. Therefore, there can be reasonable doubts that GitHub's subset of software projects may not actually be that representative of the larger free-software community.![]()
The long history of Debian creates a perfect subject to evaluate how FOSS licenses use has evolved over time, and the popularity of licenses currently in use.Sullivan argued that the Debsources data set is interesting because of its quality: every package in Debian has been reviewed by multiple humans, including the original packager, but also by the FTP masters to ensure that the distribution can legally redistribute the software. The existence of a package in Debian provides a minimal "proof of use": unmaintained packages get removed from Debian on a regular basis and the mere fact that a piece of software gets packaged in Debian means at least some users found it important enough to work on packaging it. Debian packagers make specific efforts to avoid code duplication between packages in order to ease security maintenance. The data set covers a period longer than Black Duck's or GitHub's, as it goes all the way back to the Hamm 2.0 release in 1998. The data and how to reproduce it are freely available under a CC BY-SA 4.0 license.
Sullivan presented the above graph from the research paper that showed the evolution of software license use in the Debian archive. Whereas previous graphs showed statistics in percentages, this one showed actual absolute numbers, where we can't actually distinguish a decline in copyleft licenses. To quote the paper again:![]()
The top license is, once again, GPL-2.0+, followed by: Artistic-1.0/GPL dual-licensing (the licensing choice of Perl and most Perl libraries), GPL-3.0+, and Apache-2.0.Indeed, looking at the graph, at most do we see a rise of the Apache and MIT licenses and no decline of the GPL per se, although its adoption does seem to slow down in recent years. We should also mention the possibility that Debian's data set has the opposite bias: toward GPL software. The Debian project is culturally quite different from the GitHub community and even the larger free-software ecosystem, naturally, which could explain the disparity in the results. We can only hope a similar analysis can be performed on the much larger Software Heritage data set eventually, which may give more representative results. The paper acknowledges this problem:
Debian is likely representative of enterprise use of FOSS as a base operating system, where stable, long-term and seldomly updated software products are desirable. Conversely Debian is unlikely representative of more dynamic FOSS environments (e.g., modern Web-development with micro libraries) where users, who are usually developers themselves, expect to receive library updates on a daily basis.The Debsources research also shares methodology limitations with Black Duck: while Debian packages are reviewed before uploading and we can rely on the copyright information provided by Debian maintainers, the research also relies on automated tools (specifically FOSSology) to retrieve license information. Sullivan also warned against "ascribing reason to numbers": people may have different reasons for choosing a particular license. Developers may choose the MIT license because it has fewer words, for compatibility reasons, or simply because "their lawyers told them to". It may not imply an actual deliberate philosophical or ideological choice. Finally, he brought up the theory that the rise of non-copyleft licenses isn't necessarily at the detriment of the GPL. He explained that, even if there is an actual decline, it may not be much of a problem if there is an overall growth of free software to the detriment of proprietary software. He reminded the audience that non-copyleft licenses are still free software, according to the FSF and the Debian Free Software Guidelines, so their rise is still a positive outcome. Even if the GPL is a better tool to accomplish the goal of a free-software world, we can all acknowledge that the conversion of proprietary software to more permissive and certainly simpler licenses is definitely heading in the right direction.
[I would like to thank the DebConf organizers for providing meals for me during the conference.] Note: this article first appeared in the Linux Weekly News.
Next.