Antoine Beaupr : Django debates privacy concern
In recent years, privacy issues have become a growing concern among
free-software projects and users. As more and more software tasks become
web-based, surveillance and tracking of users is also on the rise. While some
software may use advertising as a source of revenue, which has the side
effect of monitoring users, the Django
community recently got into an interesting debate surrounding a proposal
to add user tracking actually
developer tracking to the popular Python web framework.
Tracking for funding
A novel aspect of this debate is that the initiative comes from
concerns of the Django Software Foundation (DSF) about funding. The
proposal suggests that "relying on the free labor of volunteers is
ineffective, unfair, and risky" and states that "the future of Django
depends on our ability to fund its development". In fact, the DSF
recently hired an engineer to help oversee Django's development, which
has been quite successful in helping the project make timely releases
with fewer bugs. Various fundraising efforts have resulted in major
new Django features, but it is difficult to attract sponsors without
some hard data on the usage of Django.
The proposed feature tries to count the number of "unique developers"
and gather some metrics of their environments by using Google
Analytics (GA) in Django. The actual proposal (DEP 8) is done as a
pull request, which is part of
Django Enhancement Proposal (DEP) process that is similar in
spirit to the Python Enhancement Proposal (PEP) process. DEP 8 was
brought forward by a longtime Django developer, Jacob Kaplan-Moss.
The rationale is that "if we had clear data on the extent of Django's
usage, it would be much easier to approach organizations for funding".
The proposal is essentially about adding code in Django to send a
certain set of metrics when "developer" commands are run. The system
would be "opt-out", enabled by default unless turned off, although the
developer would be warned the first time the phone-home system is
used. The proposal notes that an opt-in system "severely undercounts"
and is therefore not considered "substantially better than a community
survey" that the DSF is already doing.
Tracking for funding
A novel aspect of this debate is that the initiative comes from
concerns of the Django Software Foundation (DSF) about funding. The
proposal suggests that "relying on the free labor of volunteers is
ineffective, unfair, and risky" and states that "the future of Django
depends on our ability to fund its development". In fact, the DSF
recently hired an engineer to help oversee Django's development, which
has been quite successful in helping the project make timely releases
with fewer bugs. Various fundraising efforts have resulted in major
new Django features, but it is difficult to attract sponsors without
some hard data on the usage of Django.
The proposed feature tries to count the number of "unique developers"
and gather some metrics of their environments by using Google
Analytics (GA) in Django. The actual proposal (DEP 8) is done as a
pull request, which is part of
Django Enhancement Proposal (DEP) process that is similar in
spirit to the Python Enhancement Proposal (PEP) process. DEP 8 was
brought forward by a longtime Django developer, Jacob Kaplan-Moss.
The rationale is that "if we had clear data on the extent of Django's
usage, it would be much easier to approach organizations for funding".
The proposal is essentially about adding code in Django to send a
certain set of metrics when "developer" commands are run. The system
would be "opt-out", enabled by default unless turned off, although the
developer would be warned the first time the phone-home system is
used. The proposal notes that an opt-in system "severely undercounts"
and is therefore not considered "substantially better than a community
survey" that the DSF is already doing.
Information gathered
The pieces of information reported are specifically designed to run only
in a developer's environment and not in production. The metrics
identified are, at the time of writing:
- an event category (the developer commands:
startproject
,
startapp
, runserver
)
- the HTTP User-Agent string identifying the Django, Python, and OS
versions
- a user-specific unique identifier (a UUID generated on first run)
The proposal mentions the use of the GA aip
flag which,
according to GA documentation, makes "the IP address of the sender
'anonymized'". It is not quite clear how that is done at Google and,
given that it is a proprietary platform, there is no way to verify
that claim. The proposal says it means that "we can't see, and Google
Analytics doesn't store, your actual IP". But that is not actually
what Google does: GA stores IP addresses, the documentation just says
they are anonymized, without explaining how.
GA is presented as a trade-off, since "Google's track record indicates
that they don't value privacy nearly as high" as the DSF does. The
alternative, deploying its own analytics software, was presented as
making sustainability problems worse. According to the proposal, Google
"can't track Django users. [...] The only thing Google could do would be
to lie about anonymizing IP addresses, and attempt to match users based
on their IPs".
The truth is that we don't actually know what Google means when it
"anonymizes" data: Jannis Leidel, a Django team member, commented
that "Google has previously been subjected to secret US court orders
and was required to collaborate in mass surveillance conducted by US
intelligence services" that limit even Google's capacity of ensuring
its users' anonymity. Leidel also argued that the legal framework of
the US may not apply elsewhere in the world: "for example the strict
German (and by extension EU) privacy laws would exclude the automatic
opt-in as a lawful option".
Furthermore, the proposal claims that "if we discovered Google was
lying about this, we'd obviously stop using them immediately", but it
is unclear exactly how this could be implemented if the software was
already deployed. There are also concerns that an
implementation could block normal operation, especially in countries
(like China) where Google itself may be blocked. Finally, some
expressed concerns that the information could constitute a
security problem, since it would unduly expose the version number of
Django that is running.
In other projects
Django is certainly not the first project to consider implementing
analytics to get more information about its users. The proposal is
largely inspired by a similar system implemented by the
OS X Homebrew package manager, which has its own
opt-out analytics.
Other projects embed GA code directly in their web pages. This is
apparently the option chosen by the Oscar Django-based ecommerce
solution, but that was seen by the DSF as less useful since it would
count Django administrators and wasn't seen as useful as counting
developers. Wagtail, a Django-based content-management
system, was incorrectly identified as using GA directly, as well.
It actually uses referrer information to identify installed domains
through the version updates checks, with opt-out. Wagtail didn't use
GA because the project wanted only minimal data and it was worried
about users' reactions.
NPM, the JavaScript package manager, also considered similar
tracking extensions. Laurie Voss, the co-founder of NPM, said it
decided to completely avoid phoning home, because "users would
absolutely hate it". But NPM users are constantly downloading packages
to rebuild applications from scratch, so it has more complete usage metrics, which are
aggregated and available via a public API. NPM users seem to find this
is a "reasonable utility/privacy trade". Some NPM packages do phone
home and have seen "very mixed" feedback from users, Voss said.
Eric Holscher, co-founder of Read the Docs, said the project
is considering using Sentry for centralized reporting, which is a
different idea, but interesting considering Sentry is fully open
source. So even though it is a commercial service (as opposed to the
closed-source Google Analytics), it may be possible to verify any
anonymity claims.
Debian's response
Since Django is shipped with Debian, one concern was the reaction of
the distribution to the change. Indeed, "major distros' positions
would be very important for public reception" to the feature,
another developer stated.
One of the current maintainers of Django in Debian, Rapha l Hertzog,
explicitly stated from the start that such a system would "likely
be disabled by default in Debian". There were two short
discussions on Debian mailing lists where the overall consensus
seemed to be that any opt-out tracking code was undesirable in
Debian, especially if it was aimed at Google servers.
I have done some research to see what, exactly, was acceptable as a
phone-home system in the Debian community. My research has revealed
ten distinct bug reports against packages that would unexpectedly
connect to the network, most of which were not directly about
collecting statistics but more often about checking for new
versions. In most cases I found, the feature was disabled. In the case
of version checks, it seems right for Debian to disable the feature,
because the package cannot upgrade itself: that task is delegated to
the package manager. One of those issues was the infamous "OK Google"
voice activation binary blog controversy that was
previously reported here and has since then been fixed (although
other issues remain in Chromium).
I have also found out that there is no clearly defined policy in
Debian regarding tracking software. What I have found, however, is
that there seems to be a strong consensus in Debian that any
tracking is unacceptable. This is, for example, an extract of a policy
that was drafted (but never formally adopted) by Ian Jackson, a
longtime Debian developer:
Software in Debian should not communicate over the network except:
in order to, and as necessary to, perform their function[...]; or
for other purposes with explicit permission from the user.
In other words, opt-in only, period.
Jackson explained that "when we originally wrote the core of the
policy documents, the DFSG [Debian Free Software Guidelines], the SC
[Social Contract], and so on, no-one would have considered this
behaviour acceptable", which explains why no explicit formal policy
has been adopted yet in the Debian project.
One of the concerns with opt-out systems (or even prompts that default
to opt-in) was well explained back then by Debian developer Bas
Wijnen:
It very much resembles having to click through a license for every
package you install. One of the nice things about Debian is that the
user doesn't need to worry about such things: Debian makes sure
things are fine.
One could argue that Debian has its own tracking systems. For example,
by default, Debian will "phone home" through the APT update system
(though it only reports the packages requested).
However, this is currently not automated by default, although there
are plans to do so soon. Furthermore, Debian members do not
consider APT as tracking, because it needs to connect to the network
to accomplish its primary function. Since there are multiple
distributed mirrors (which the user gets to choose when installing),
the risk of surveillance and tracking is also greatly reduced.
A better parallel could be drawn with Debian's popcon system,
which actually tracks Debian installations, including package
lists. But as Barry Warsaw pointed out in that discussion, "popcon
is 'opt-in' and [...] the overwhelming majority in Debian is in favour
of it in contrast to 'opt-out'". It should be noted that popcon, while
opt-in, defaults to "yes" if users click
through the install process. [Update: As pointed out in the
comments, popcon actually defaults to "no" in Debian.] There are around 200,000 submissions at this time, which are
tracked with machine-specific unique identifiers that are submitted
daily. Ubuntu, which also uses the popcon software, gets around
2.8 million daily submissions, while Canonical estimates there are
40 million desktop users of Ubuntu. This would mean there is about
an order of magnitude more installations than what is reported by
popcon.
Policy aside, Warsaw explained that "Debian has a reputation for taking
privacy issues very serious and likes to keep it".
Next steps
There are obviously disagreements within the Django project about how
to handle this problem. It looks like the phone-home system may end up
being implemented as a proxy system "which would allow us to strip IP
addresses instead of relying on Google to anonymize them, or to
anonymize them ourselves", another Django developer, Aymeric
Augustin, said.
Augustin also stated that the feature wouldn't "land
before Django drops support for Python 2", which is currently
estimated to be around 2020. It is unclear, then, how the proposal
would resolve the funding issues, considering how long it would take
to deploy the change and then collect the information so that it can
be used to spur the funding efforts.
It also seems the system may explicitly prompt the user, with an
opt-out default, instead of just splashing a warning or privacy
agreement without a prompt. As Shai Berger, another Django
contributor, stated, "you do not get [those] kind of numbers in
community surveys". Berger also made the argument that "we trust the
community to give back without being forced to do so"; furthermore:
I don't believe the increase we might get in the number of reports
by making it harder to opt-out, can be worth the ill-will generated
for people who might feel the reporting was "sneaked" upon them, or
even those who feel they were nagged into participation rather than
choosing to participate.
Other options may also include gathering metrics in pip
or PyPI,
which was proposed by Donald Stufft. Leidel also proposed
that the system could ask to opt-in only after a few times the
commands are called.
It is encouraging to see that a community can discuss such issues
without heating up too much and shows great maturity for the Django
project. Every free-software project may be confronted with funding and
sustainability issues. Django seems to be trying to address this in a
transparent way. The project is willing to engage with the whole
spectrum of the community, from the top leaders to downstream
distributors, including individual developers. This practice should
serve as a model, if not of how to do funding or tracking, at least of
how to discuss those issues productively.
Everyone seems to agree the point is not to surveil users, but improve
the software. As Lars Wirzenius, a Debian developer,
commented: "it's a very sad situation if free
software projects have to compromise on privacy to get
funded". Hopefully, Django will be able to improve its funding without
compromising its principles.
Note: this article first appeared in the Linux Weekly News.
- an event category (the developer commands:
startproject
,startapp
,runserver
) - the HTTP User-Agent string identifying the Django, Python, and OS versions
- a user-specific unique identifier (a UUID generated on first run)
aip
flag which,
according to GA documentation, makes "the IP address of the sender
'anonymized'". It is not quite clear how that is done at Google and,
given that it is a proprietary platform, there is no way to verify
that claim. The proposal says it means that "we can't see, and Google
Analytics doesn't store, your actual IP". But that is not actually
what Google does: GA stores IP addresses, the documentation just says
they are anonymized, without explaining how.
GA is presented as a trade-off, since "Google's track record indicates
that they don't value privacy nearly as high" as the DSF does. The
alternative, deploying its own analytics software, was presented as
making sustainability problems worse. According to the proposal, Google
"can't track Django users. [...] The only thing Google could do would be
to lie about anonymizing IP addresses, and attempt to match users based
on their IPs".
The truth is that we don't actually know what Google means when it
"anonymizes" data: Jannis Leidel, a Django team member, commented
that "Google has previously been subjected to secret US court orders
and was required to collaborate in mass surveillance conducted by US
intelligence services" that limit even Google's capacity of ensuring
its users' anonymity. Leidel also argued that the legal framework of
the US may not apply elsewhere in the world: "for example the strict
German (and by extension EU) privacy laws would exclude the automatic
opt-in as a lawful option".
Furthermore, the proposal claims that "if we discovered Google was
lying about this, we'd obviously stop using them immediately", but it
is unclear exactly how this could be implemented if the software was
already deployed. There are also concerns that an
implementation could block normal operation, especially in countries
(like China) where Google itself may be blocked. Finally, some
expressed concerns that the information could constitute a
security problem, since it would unduly expose the version number of
Django that is running.
In other projects
Django is certainly not the first project to consider implementing
analytics to get more information about its users. The proposal is
largely inspired by a similar system implemented by the
OS X Homebrew package manager, which has its own
opt-out analytics.
Other projects embed GA code directly in their web pages. This is
apparently the option chosen by the Oscar Django-based ecommerce
solution, but that was seen by the DSF as less useful since it would
count Django administrators and wasn't seen as useful as counting
developers. Wagtail, a Django-based content-management
system, was incorrectly identified as using GA directly, as well.
It actually uses referrer information to identify installed domains
through the version updates checks, with opt-out. Wagtail didn't use
GA because the project wanted only minimal data and it was worried
about users' reactions.
NPM, the JavaScript package manager, also considered similar
tracking extensions. Laurie Voss, the co-founder of NPM, said it
decided to completely avoid phoning home, because "users would
absolutely hate it". But NPM users are constantly downloading packages
to rebuild applications from scratch, so it has more complete usage metrics, which are
aggregated and available via a public API. NPM users seem to find this
is a "reasonable utility/privacy trade". Some NPM packages do phone
home and have seen "very mixed" feedback from users, Voss said.
Eric Holscher, co-founder of Read the Docs, said the project
is considering using Sentry for centralized reporting, which is a
different idea, but interesting considering Sentry is fully open
source. So even though it is a commercial service (as opposed to the
closed-source Google Analytics), it may be possible to verify any
anonymity claims.
Debian's response
Since Django is shipped with Debian, one concern was the reaction of
the distribution to the change. Indeed, "major distros' positions
would be very important for public reception" to the feature,
another developer stated.
One of the current maintainers of Django in Debian, Rapha l Hertzog,
explicitly stated from the start that such a system would "likely
be disabled by default in Debian". There were two short
discussions on Debian mailing lists where the overall consensus
seemed to be that any opt-out tracking code was undesirable in
Debian, especially if it was aimed at Google servers.
I have done some research to see what, exactly, was acceptable as a
phone-home system in the Debian community. My research has revealed
ten distinct bug reports against packages that would unexpectedly
connect to the network, most of which were not directly about
collecting statistics but more often about checking for new
versions. In most cases I found, the feature was disabled. In the case
of version checks, it seems right for Debian to disable the feature,
because the package cannot upgrade itself: that task is delegated to
the package manager. One of those issues was the infamous "OK Google"
voice activation binary blog controversy that was
previously reported here and has since then been fixed (although
other issues remain in Chromium).
I have also found out that there is no clearly defined policy in
Debian regarding tracking software. What I have found, however, is
that there seems to be a strong consensus in Debian that any
tracking is unacceptable. This is, for example, an extract of a policy
that was drafted (but never formally adopted) by Ian Jackson, a
longtime Debian developer:
Software in Debian should not communicate over the network except:
in order to, and as necessary to, perform their function[...]; or
for other purposes with explicit permission from the user.
In other words, opt-in only, period.
Jackson explained that "when we originally wrote the core of the
policy documents, the DFSG [Debian Free Software Guidelines], the SC
[Social Contract], and so on, no-one would have considered this
behaviour acceptable", which explains why no explicit formal policy
has been adopted yet in the Debian project.
One of the concerns with opt-out systems (or even prompts that default
to opt-in) was well explained back then by Debian developer Bas
Wijnen:
It very much resembles having to click through a license for every
package you install. One of the nice things about Debian is that the
user doesn't need to worry about such things: Debian makes sure
things are fine.
One could argue that Debian has its own tracking systems. For example,
by default, Debian will "phone home" through the APT update system
(though it only reports the packages requested).
However, this is currently not automated by default, although there
are plans to do so soon. Furthermore, Debian members do not
consider APT as tracking, because it needs to connect to the network
to accomplish its primary function. Since there are multiple
distributed mirrors (which the user gets to choose when installing),
the risk of surveillance and tracking is also greatly reduced.
A better parallel could be drawn with Debian's popcon system,
which actually tracks Debian installations, including package
lists. But as Barry Warsaw pointed out in that discussion, "popcon
is 'opt-in' and [...] the overwhelming majority in Debian is in favour
of it in contrast to 'opt-out'". It should be noted that popcon, while
opt-in, defaults to "yes" if users click
through the install process. [Update: As pointed out in the
comments, popcon actually defaults to "no" in Debian.] There are around 200,000 submissions at this time, which are
tracked with machine-specific unique identifiers that are submitted
daily. Ubuntu, which also uses the popcon software, gets around
2.8 million daily submissions, while Canonical estimates there are
40 million desktop users of Ubuntu. This would mean there is about
an order of magnitude more installations than what is reported by
popcon.
Policy aside, Warsaw explained that "Debian has a reputation for taking
privacy issues very serious and likes to keep it".
Next steps
There are obviously disagreements within the Django project about how
to handle this problem. It looks like the phone-home system may end up
being implemented as a proxy system "which would allow us to strip IP
addresses instead of relying on Google to anonymize them, or to
anonymize them ourselves", another Django developer, Aymeric
Augustin, said.
Augustin also stated that the feature wouldn't "land
before Django drops support for Python 2", which is currently
estimated to be around 2020. It is unclear, then, how the proposal
would resolve the funding issues, considering how long it would take
to deploy the change and then collect the information so that it can
be used to spur the funding efforts.
It also seems the system may explicitly prompt the user, with an
opt-out default, instead of just splashing a warning or privacy
agreement without a prompt. As Shai Berger, another Django
contributor, stated, "you do not get [those] kind of numbers in
community surveys". Berger also made the argument that "we trust the
community to give back without being forced to do so"; furthermore:
I don't believe the increase we might get in the number of reports
by making it harder to opt-out, can be worth the ill-will generated
for people who might feel the reporting was "sneaked" upon them, or
even those who feel they were nagged into participation rather than
choosing to participate.
Other options may also include gathering metrics in pip
or PyPI,
which was proposed by Donald Stufft. Leidel also proposed
that the system could ask to opt-in only after a few times the
commands are called.
It is encouraging to see that a community can discuss such issues
without heating up too much and shows great maturity for the Django
project. Every free-software project may be confronted with funding and
sustainability issues. Django seems to be trying to address this in a
transparent way. The project is willing to engage with the whole
spectrum of the community, from the top leaders to downstream
distributors, including individual developers. This practice should
serve as a model, if not of how to do funding or tracking, at least of
how to discuss those issues productively.
Everyone seems to agree the point is not to surveil users, but improve
the software. As Lars Wirzenius, a Debian developer,
commented: "it's a very sad situation if free
software projects have to compromise on privacy to get
funded". Hopefully, Django will be able to improve its funding without
compromising its principles.
Note: this article first appeared in the Linux Weekly News.
Software in Debian should not communicate over the network except: in order to, and as necessary to, perform their function[...]; or for other purposes with explicit permission from the user.In other words, opt-in only, period. Jackson explained that "when we originally wrote the core of the policy documents, the DFSG [Debian Free Software Guidelines], the SC [Social Contract], and so on, no-one would have considered this behaviour acceptable", which explains why no explicit formal policy has been adopted yet in the Debian project. One of the concerns with opt-out systems (or even prompts that default to opt-in) was well explained back then by Debian developer Bas Wijnen:
It very much resembles having to click through a license for every package you install. One of the nice things about Debian is that the user doesn't need to worry about such things: Debian makes sure things are fine.One could argue that Debian has its own tracking systems. For example, by default, Debian will "phone home" through the APT update system (though it only reports the packages requested). However, this is currently not automated by default, although there are plans to do so soon. Furthermore, Debian members do not consider APT as tracking, because it needs to connect to the network to accomplish its primary function. Since there are multiple distributed mirrors (which the user gets to choose when installing), the risk of surveillance and tracking is also greatly reduced. A better parallel could be drawn with Debian's popcon system, which actually tracks Debian installations, including package lists. But as Barry Warsaw pointed out in that discussion, "popcon is 'opt-in' and [...] the overwhelming majority in Debian is in favour of it in contrast to 'opt-out'". It should be noted that popcon, while opt-in,