Search Results: "daniels"

8 December 2020

Russell Coker: Links December 2020

Business Insider has an informative article about the way that Google users can get locked out with no apparent reason and no recourse [1]. Something to share with clients when they consider putting everything in the cloud . Vice has an interestoing article about people jailbreaking used Teslas after Tesla has stolen software licenses that were bought with the car [2]. The Atlantic has an interesting article titled This Article Won t Change Your Mind [3]. It s one of many on the topic of echo chambers but has some interesting points that others don t seem to cover, such as regarding the benefits of groups when not everyone agrees. Inequality.org has lots of useful information about global inequality [4]. Jeffrey Goldberg has an insightful interview with Barack Obama for the Atlantic about the future course of American politics and a retrospective on his term in office [5]. A Game Designer s Analysis Of QAnon is an insightful Medium article comparing QAnon to an augmented reality game [6]. This is one of the best analysis of QAnon operations that I ve seen. Decrypting Rita is one of the most interesting web comics I ve read [7]. It makes good use of side scrolling and different layers to tell multiple stories at once. PC Mag has an article about the new features in Chrome 87 to reduce CPU use [8]. On my laptop I have 1/3 of all CPU time being used when it is idle, the majority of which is from Chrome. As the CPU has 2 cores this means the equivalent of 1 core running about 66% of the time just for background tabs. I have over 100 tabs open which I admit is a lot. But it means that the active tabs (as opposed to the plain HTML or PDF ones) are averaging more than 1% CPU time on an i7 which seems obviously unreasonable. So Chrome 87 doesn t seem to live up to Google s claims. The movie Bad President starring Stormy Daniels as herself is out [9]. Poe s Law is passe. Interesting summary of Parler, seems that it was designed by the Russians [10]. Wired has an interesting article about Indistinguishability Obfuscation, how to encrypt the operation of a program [11]. Joerg Jaspert wrote an interesting blog post about the difficulties packagine Rust and Go for Debian [12]. I think that the problem is many modern languages aren t designed well for library updates. This isn t just a problem for Debian, it s a problem for any long term support of software that doesn t involve transferring a complete archive of everything and it s a problem for any disconnected development (remote sites and sites dealing with serious security. Having an automatic system for downloading libraries is fine. But there should be an easy way of getting the same source via an archive format (zip will do as any archive can be converted to any other easily enough) and with version numbers.

12 October 2017

Joachim Breitner: Isabelle functions: Always total, sometimes undefined

Often, when I mention how things work in the interactive theorem prover [Isabelle/HOL] (in the following just Isabelle 1) to people with a strong background in functional programming (whether that means Haskell or Coq or something else), I cause confusion, especially around the issue of what is a function, are function total and what is the business with undefined. In this blog post, I want to explain some these issues, aimed at functional programmers or type theoreticians. Note that this is not meant to be a tutorial; I will not explain how to do these things, and will focus on what they mean.

HOL is a logic of total functions If I have a Isabelle function f :: a b between two types a and b (the function arrow in Isabelle is , not ), then by definition of what it means to be a function in HOL whenever I have a value x :: a, then the expression f x (i.e. f applied to x) is a value of type b. Therefore, and without exception, every Isabelle function is total. In particular, it cannot be that f x does not exist for some x :: a. This is a first difference from Haskell, which does have partial functions like
spin :: Maybe Integer -> Bool
spin (Just n) = spin (Just (n+1))
Here, neither the expression spin Nothing nor the expression spin (Just 42) produce a value of type Bool: The former raises an exception ( incomplete pattern match ), the latter does not terminate. Confusingly, though, both expressions have type Bool. Because every function is total, this confusion cannot arise in Isabelle: If an expression e has type t, then it is a value of type t. This trait is shared with other total systems, including Coq. Did you notice the emphasis I put on the word is here, and how I deliberately did not write evaluates to or returns ? This is because of another big source for confusion:

Isabelle functions do not compute We (i.e., functional programmers) stole the word function from mathematics and repurposed it2. But the word function , in the context of Isabelle, refers to the mathematical concept of a function, and it helps to keep that in mind. What is the difference?
  • A function a b in functional programming is an algorithm that, given a value of type a, calculates (returns, evaluates to) a value of type b.
  • A function a b in math (or Isabelle) associates with each value of type a a value of type b.
For example, the following is a perfectly valid function definition in math (and HOL), but could not be a function in the programming sense:
definition foo :: "(nat   real)   real" where
  "foo seq = (if convergent seq then lim seq else 0)"
This assigns a real number to every sequence, but it does not compute it in any useful sense. From this it follows that

Isabelle functions are specified, not defined Consider this function definition:
fun plus :: "nat   nat   nat"  where
   "plus 0       m = m"
   "plus (Suc n) m = Suc (plus n m)"
To a functional programmer, this reads
plus is a function that analyses its first argument. If that is 0, then it returns the second argument. Otherwise, it calls itself with the predecessor of the first argument and increases the result by one.
which is clearly a description of a computation. But to Isabelle, the above reads
plus is a binary function on natural numbers, and it satisfies the following two equations:
And in fact, it is not so much Isabelle that reads it this way, but rather the fun command, which is external to the Isabelle logic. The fun command analyses the given equations, constructs a non-recursive definition of plus under the hood, passes that to Isabelle and then proves that the given equations hold for plus. One interesting consequence of this is that different specifications can lead to the same functions. In fact, if we would define plus' by recursing on the second argument, we d obtain the the same function (i.e. plus = plus' is a theorem, and there would be no way of telling the two apart).

Termination is a property of specifications, not functions Because a function does not evaluate, it does not make sense to ask if it terminates. The question of termination arises before the function is defined: The fun command can only construct plus in a way that the equations hold if it passes a termination check very much like Fixpoint in Coq. But while the termination check of Fixpoint in Coq is a deep part of the basic logic, in Isabelle it is simply something that this particular command requires for its internal machinery to go through. At no point does a termination proof of the function exist as a theorem inside the logic. And other commands may have other means of defining a function that do not even require such a termination argument! For example, a function specification that is tail-recursive can be turned in to a function, even without a termination proof: The following definition describes a higher-order function that iterates its first argument f on the second argument x until it finds a fixpoint. It is completely polymorphic (the single quote in 'a indicates that this is a type variable):
partial_function (tailrec)
  fixpoint :: "('a   'a)   'a   'a"
where
  "fixpoint f x = (if f x = x then x else fixpoint f (f x))"
We can work with this definition just fine. For example, if we instantiate f with ( x. x-1), we can prove that it will always return 0:
lemma "fixpoint (  n . n - 1) (n::nat) = 0"
  by (induction n) (auto simp add: fixpoint.simps)
Similarly, if we have a function that works within the option monad (i.e. Maybe in Haskell), its specification can always be turned into a function without an explicit termination proof here one that calculates the Collatz sequence:
partial_function (option) collatz :: "nat   nat list option"
 where "collatz n =
        (if n = 1 then Some [n]
         else if even n
           then do   ns <- collatz (n div 2);    Some (n # ns)  
           else do   ns <- collatz (3 * n + 1);  Some (n # ns) )"
Note that lists in Isabelle are finite (like in Coq, unlike in Haskell), so this function returns a list only if the collatz sequence eventually reaches 1. I expect these definitions to make a Coq user very uneasy. How can fixpoint be a total function? What is fixpoint ( n. n+1)? What if we run collatz n for a n where the Collatz sequence does not reach 1?3 We will come back to that question after a little detour

HOL is a logic of non-empty types Another big difference between Isabelle and Coq is that in Isabelle, every type is inhabited. Just like the totality of functions, this is a very fundamental fact about what HOL defines to be a type. Isabelle gets away with that design because in Isabelle, we do not use types for propositions (like we do in Coq), so we do not need empty types to denote false propositions. This design has an important consequence: It allows the existence of a polymorphic expression that inhabits any type, namely
undefined :: 'a
The naming of this term alone has caused a great deal of confusion for Isabelle beginners, or in communication with users of different systems, so I implore you to not read too much into the name. In fact, you will have a better time if you think of it as arbitrary or, even better, unknown. Since undefined can be instantiated at any type, we can instantiate it for example at bool, and we can observe an important fact: undefined is not an extra value besides the usual ones . It is simply some value of that type, which is demonstrated in the following lemma:
lemma "undefined = True   undefined = False" by auto
In fact, if the type has only one value (such as the unit type), then we know the value of undefined for sure:
lemma "undefined = ()" by auto
It is very handy to be able to produce an expression of any type, as we will see as follows

Partial functions are just underspecified functions For example, it allows us to translate incomplete function specifications. Consider this definition, Isabelle s equivalent of Haskell s partial fromJust function:
fun fromSome :: "'a option   'a" where
  "fromSome (Some x) = x"
This definition is accepted by fun (albeit with a warning), and the generated function fromSome behaves exactly as specified: when applied to Some x, it is x. The term fromSome None is also a value of type 'a, we just do not know which one it is, as the specification does not address that. So fromSome None behaves just like undefined above, i.e. we can prove
lemma "fromSome None = False   fromSome None = True" by auto
Here is a small exercise for you: Can you come up with an explanation for the following lemma:
fun constOrId :: "bool   bool" where
  "constOrId True = True"
lemma "constOrId = ( _.True)   constOrId = ( x. x)"
  by (metis (full_types) constOrId.simps)
Overall, this behavior makes sense if we remember that function definitions in Isabelle are not really definitions, but rather specifications. And a partial function definition is simply a underspecification. The resulting function is simply any function hat fulfills the specification, and the two lemmas above underline that observation.

Nonterminating functions are also just underspecified Let us return to the puzzle posed by fixpoint above. Clearly, the function seen as a functional program is not total: When passed the argument ( n. n + 1) or ( b. b) it will loop forever trying to find a fixed point. But Isabelle functions are not functional programs, and the definitions are just specifications. What does the specification say about the case when f has no fixed-point? It states that the equation fixpoint f x = fixpoint f (f x) holds. And this equation has a solution, for example fixpoint f _ = undefined. Or more concretely: The specification of the fixpoint function states that fixpoint ( b. b) True = fixpoint ( b. b) False has to hold, but it does not specify which particular value (True or False) it should denote any is fine.

Not all function specifications are ok At this point you might wonder: Can I just specify any equations for a function f and get a function out of that? But rest assured: That is not the case. For example, no Isabelle command allows you define a function bogus :: () nat with the equation bogus () = Suc (bogus ()), because this equation does not have a solution. We can actually prove that such a function cannot exist:
lemma no_bogus: "  bogus. bogus () = Suc (bogus ())" by simp
(Of course, not_bogus () = not_bogus () is just fine )

You cannot reason about partiality in Isabelle We have seen that there are many ways to define functions that one might consider partial . Given a function, can we prove that it is not partial in that sense? Unfortunately, but unavoidably, no: Since undefined is not a separate, recognizable value, but rather simply an unknown one, there is no way of stating that A function result is not specified . Here is an example that demonstrates this: Two partial functions (one with not all cases specified, the other one with a self-referential specification) are indistinguishable from the total variant:
fun partial1 :: "bool   unit" where
  "partial1 True = ()"
partial_function (tailrec) partial2 :: "bool   unit" where
  "partial2 b = partial2 b"
fun total :: "bool   unit" where
  "total True = ()"
  "total False = ()"
lemma "partial1 = total   partial2 = total" by auto
If you really do want to reason about partiality of functional programs in Isabelle, you should consider implementing them not as plain HOL functions, but rather use HOLCF, where you can give equational specifications of functional programs and obtain continuous functions between domains. In that setting, () and partial2 = total. We have done that to verify some of HLint s equations.

You can still compute with Isabelle functions I hope by this point, I have not scared away anyone who wants to use Isabelle for functional programming, and in fact, you can use it for that. If the equations that you pass to fun are a reasonable definition for a function (in the programming sense), then these equations, used as rewriting rules, will allow you to compute that function quite like you would in Coq or Haskell. Moreover, Isabelle supports code extraction: You can take the equations of your Isabelle functions and have them expored into Ocaml, Haskell, Scala or Standard ML. See Concon for a conference management system with confidentially verified in Isabelle. While these usually are the equations you defined the function with, they don't have to: You can declare other proved equations to be used for code extraction, e.g. to refine your elegant definitions to performant ones. Like with code extraction from Coq to, say, Haskell, the adequacy of the translations rests on a moral reasoning foundation. Unlike extraction from Coq, where you have an (unformalized) guarantee that the resulting Haskell code is terminating, you do not get that guarantee from Isabelle. Conversely, this allows you do reason about and extract non-terminating programs, like fixpoint, which is not possible in Coq. There is currently ongoing work about verified code generation, where the code equations are reflected into a deep embedding of HOL in Isabelle that would allow explicit termination proofs.

Conclusion We have seen how in Isabelle, every function is total. Function declarations have equations, but these do not define the function in an computational sense, but rather specify them. Because in HOL, there are no empty types, many specifications that appear partial (incomplete patterns, non-terminating recursion) have solutions in the space of total functions. Partiality in the specification is no longer visible in the final product.

PS: Axiom undefined in Coq This section is speculative, and an invitation for discussion. Coq already distinguishes between types used in programs (Set) and types used in proofs Prop. Could Coq ensure that every t : Set is non-empty? I imagine this would require additional checks in the Inductive command, similar to the checks that the Isabelle command datatype has to perform4, and it would disallow Empty_set. If so, then it would be sound to add the following axiom
Axiom undefined : forall (a : Set), a.
wouldn't it? This axiom does not have any computational meaning, but that seems to be ok for optional Coq axioms, like classical reasoning or function extensionality. With this in place, how much of what I describe above about function definitions in Isabelle could now be done soundly in Coq. Certainly pattern matches would not have to be complete and could sport an implicit case _ undefined. Would it help with non-obviously terminating functions? Would it allow a Coq command Tailrecursive that accepts any tailrecursive function without a termination check?

  1. Isabelle is a metalogical framework, and other logics, e.g. Isabelle/ZF, behave differently. For the purpose of this blog post, I always mean Isabelle/HOL.
  2. Isabelle is a metalogical framework, and other logics, e.g. Isabelle/ZF, behave differently. For the purpose of this blog post, I always mean Isabelle/HOL.
  3. Let me know if you find such an n. Besides n = 0.
  4. Like fun, the constructions by datatype are not part of the logic, but create a type definition from more primitive notions that is isomorphic to the specified data type.

18 September 2017

Russ Allbery: Consolidation haul

My parents are less fond than I am of filling every available wall in their house with bookshelves and did a pruning of their books. A lot of them duplicated other things that I had, or didn't sound interesting, but I still ended up with two boxes of books (and now have to decide which of my books to prune, since I'm out of shelf space). Also included is the regular accumulation of new ebook purchases. Mitch Albom Tuesdays with Morrie (nonfiction)
Ilona Andrews Clean Sweep (sff)
Catherine Asaro Charmed Sphere (sff)
Isaac Asimov The Caves of Steel (sff)
Isaac Asimov The Naked Sun (sff)
Marie Brennan Dice Tales (nonfiction)
Captain Eric "Winkle" Brown Wings on My Sleeve (nonfiction)
Brian Christian & Tom Griffiths Algorithms to Live By (nonfiction)
Tom Clancy The Cardinal of the Kremlin (thriller)
Tom Clancy The Hunt for the Red October (thriller)
Tom Clancy Red Storm Rising (thriller)
April Daniels Sovereign (sff)
Tom Flynn Galactic Rapture (sff)
Neil Gaiman American Gods (sff)
Gary J. Hudson They Had to Go Out (nonfiction)
Catherine Ryan Hyde Pay It Forward (mainstream)
John Irving A Prayer for Owen Meany (mainstream)
John Irving The Cider House Rules (mainstream)
John Irving The Hotel New Hampshire (mainstream)
Lawrence M. Krauss Beyond Star Trek (nonfiction)
Lawrence M. Krauss The Physics of Star Trek (nonfiction)
Ursula K. Le Guin Four Ways to Forgiveness (sff collection)
Ursula K. Le Guin Words Are My Matter (nonfiction)
Richard Matheson Somewhere in Time (sff)
Larry Niven Limits (sff collection)
Larry Niven The Long ARM of Gil Hamilton (sff collection)
Larry Niven The Magic Goes Away (sff)
Larry Niven Protector (sff)
Larry Niven World of Ptavvs (sff)
Larry Niven & Jerry Pournelle The Gripping Hand (sff)
Larry Niven & Jerry Pournelle Inferno (sff)
Larry Niven & Jerry Pournelle The Mote in God's Eye (sff)
Flann O'Brien The Best of Myles (nonfiction)
Jerry Pournelle Exiles to Glory (sff)
Jerry Pournelle The Mercenary (sff)
Jerry Pournelle Prince of Mercenaries (sff)
Jerry Pournelle West of Honor (sff)
Jerry Pournelle (ed.) Codominium: Revolt on War World (sff anthology)
Jerry Pournelle & S.M. Stirling Go Tell the Spartans (sff)
J.D. Salinger The Catcher in the Rye (mainstream)
Jessica Amanda Salmonson The Swordswoman (sff)
Stanley Schmidt Aliens and Alien Societies (nonfiction)
Cecilia Tan (ed.) Sextopia (sff anthology)
Lavie Tidhar Central Station (sff)
Catherynne Valente Chicks Dig Gaming (nonfiction)
J.E. Zimmerman Dictionary of Classical Mythology (nonfiction) This is an interesting tour of a lot of stuff I read as a teenager (Asimov, Niven, Clancy, and Pournelle, mostly in combination with Niven but sometimes his solo work). I suspect I will no longer consider many of these books to be very good, and some of them will probably go back into used bookstores after I've re-read them for memory's sake, or when I run low on space again. But all those mass market SF novels were a big part of my teenage years, and a few (like Mote In God's Eye) I definitely want to read again. Also included is a random collection of stuff my parents picked up over the years. I don't know what to expect from a lot of it, which makes it fun to anticipate. Fall vacation is coming up, and with it a large amount of uninterrupted reading time.

8 September 2017

Vincent Fourmond: Extract many attachement from many mails in one go using ripmime

I was recently looking for a way to extract many attachments from a series of emails. I first had a look at the AttachmentExtractor thunderbird plugin, but it seems very old and not maintained anymore. So I've come up with another very simple solution that also works with any other mail client.Just copy all the mails you want to extract attachments from to a single (temporary) mail folder, find out which file holds the mail folder and use ripmime on that file (ripmime is packaged for Debian). For my case, it looked like:
~ ripmime -i .icedove/XXXXXXX.default/Mail/pop.xxxx/tmp -d target-directory
Simple solution, but it saved me quite some time. Hope it helps !

27 May 2017

Russ Allbery: Optimistic haul

I never have as much time to read as I wish I did, but I keep buying books, of course. Maybe someday I'll have a good opportunity to take extended time off work and just read for a bit. Well, retirement, at least, right? Charlie Jane Anders All the Birds in the Sky (sff)
Peter C. Brown, et al. Make It Stick (nonfiction)
April Daniels Dreadnought: Nemesis (sff)
T. Kingfisher The Halcyon Fairy Book (sff collection)
T. Kingfisher Jackalope Wives and Other Stories (sff collection)
Margot Lee Shetterly Hidden Figures (nonfiction)
Cordwainer Smith Norstrilia (sff)
Kristine Smith Code of Conduct (sff)
Jonathan Taplin Move Fast and Break Things (nonfiction)
Sarah Zettel Fool's War (sff)
Sarah Zettel Playing God (sff)
Sarah Zettel The Quiet Invasion (sff) It doesn't help that James Nicoll keeps creating new lists of books that all sound great. And there's some really interesting nonfiction being written right now. Make It Stick is the current book for the work book club.

7 March 2017

Daniel Stender: Remotely deploy a WSGI application (as a Debian package) with Ansible

This is a mini workshop as an introduction into using Ansible for the administration of Debian systems. As an example it s shown how this configuration management tool can be used to remotely set up a simple WSGI application running on an Apache web server on a Debian installation to make it available on the net. The application which is used as an example is httpbin by Runscope. This is an useful HTTP request service for the development of web software or any other purposes which features a number of specific endpoints that can be used for different testing matters. For example, the address http://<address>/user-agent of httpbin returns the user agent identification of the client program which has been used to query it (that s taken from the header of the request). There are official instances of this request server running on the net, like the one at http://httpbin.org/. WSGI is a widespread standard for programming web application in Python, and httpbin is implemented in Python using the Flask web framework. The basis of the workshop is a simple base installation of an up-to-date Debian 8 Jessie on a demonstration host, and the latest official release of that is 8.7. As a first step, the installation has to be switched over to the testing branch of Debian, because the Debian packages of httpbin are comparatively new and are going to be introduced into the stable branch of the archive the first time with the upcoming major release number 9 Stretch . After that, the Apache packages which are needed to make it available (apache2 and libapache2-mod-wsgi other web servers of course could be used instead), and which are not part of a base installation, are installed from the archive. The web server then gets launched remotely, and the httpbin package will be also pulled and the service is going to be integrated into Apache to run on that. To achieve that, two configuration files must be deployed on the target system, and a few additional operations are needed to get everything working together. Every step is preconfigured within Ansible so that the whole process could be launched by a single command on the control node, and could be run on a single or a number of comparable target machines automatically and reproducibly. If a server is needed for trying this workshop out, straightforward cloud server instances are available on the net for example at DigitalOcean, but let me underline this there are other cloud providers which offer the same things, too! If it s needed for experiments or other purposes only for a limited time, low priced droplets are available here which are billed by the hour. After being registered, the machine(s) which is/are wanted could be set up easily over the web interface (choose Debian 8.7 as OS), but there are also command line clients available like doctl (which is not yet available as a Debian package). For the convenient use of a droplet the user should generate a SSH key pair on the local machine, first:
$ ssh-keygen -t rsa -b 4096 -C "john@doe.com" -f ~/.ssh/mykey
The public part of the key ~/.ssh/mykey.pub then can be uploaded into the user account before the droplet is going to be created, it then could be integrated automatically. There is a good introduction on the whole process available in the excellent tutorial series serversforhackers.com, here. Ansible then can use the SSH keypair to login into a droplet without the need to type in the password every time. On a cloud server like this carrying a Debian base system, the examples in this workshop can be tried well. Ansible works client-less and doesn t need to be installed on the remote system but only on the control node, however a Python 2.7 interpreter is needed there (the base system of DigitalOcean includes that). For that Ansible can do anything on them, remote servers which are going to be controlled must be added to /etc/ansible/hosts. This is a configuration file in the INI format for DNS names and IP addresses. For a flexible organisation of the server inventory it s possible to group hosts here, IP ranges could be given, and optional variables can be used among other useful things (the default file contains a couple of examples). One or a couple of servers (in Ansible they are called hosts ) on which something particular is going to be happening (like httpbin is going to be installed) could be added like this (the group name is arbitrary):
[httpbin]
192.0.2.0
Whether Ansible could communicate with the hosts in the group and actually can operate on them can be verified by just pinging them like this:
$ ansible httpbin -m ping -u root --private-key=~/.ssh/mykey
192.0.2.0   SUCCESS =>  
    "changed": false, 
    "ping": "pong"
 
The command succeeded well, so it appears there isn t no significant problem regarding this machine. The return value changed:false indicates that there haven t been any changes on that host as a result of the execution of this command. Next to ping there are several other modules which could be used with the command line tool ansible the same way, and these modules are actually something like the core components of Ansible. The module shell for example can be used to execute shell commands on the remote machine like uname to get some system information returned from the server:
$ ansible httpbin -m shell -a "uname -a" -u root --private-key=~/.ssh/mykey
192.0.2.0   SUCCESS   rc=0 >>
Linux debian-512mb-fra1-01 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
In the same way, the module apt could be used to remotely install packages. But with that there s no major advantage over other software products that offer a similar functionality, and using those modules on the command line is just the basics of Ansible usage. Playbooks in Ansible are YAML scripts for the manipulation of the registered hosts in /etc/ansible/hosts. Different tasks can be defined here for successive processing, like a simple playbook for changing the package source from stable to testing for example goes like this:
---
 - hosts: httpbin
   tasks:
   - name: remove "jessie" package source
     apt_repository: repo='deb http://mirrors.digitalocean.com/debian jessie main' state=absent
   - name: add "testing" package source
     apt_repository: repo='deb http://httpredir.debian.org/debian testing main contrib non-free' state=present
   - name: upgrade packages
     apt: update_cache=yes upgrade=dist
First, like used with the CLI tool ansible above, the targeted host group httpbin is chosen. The default user root and the SSH key could be fixed here, too, to spare the need to give them on the command line. Then there are three tasks defined to get worked down consecutively: With the module apt_repository the preset package source jessie is removed from /etc/apt/sources.list. Then, a new package source for the testing archive gets added to /etc/apt/sources.list.d/ by using the same module (by the way, mirrors.digitalocean.org also provides testing, though, and that might be faster). After that, the apt module is used to upgrade the package inventory (it performs apt-get dist-upgrade), after an update of the package cache has taken place (by running apt-get update) A playbook like this (the filename is arbitrary, but commonly carries the suffix .yml) can be run by the CLI tool ansible-playbook, like this:
$ ansible-playbook httpbin.yml -u root --private-key=~/.ssh/mykey
Ansible then works down the individual plays of the tasks on the remote server(s) top-down, and due to a high speed net connection and SSD block device hardware the change of the system to being a Debian Testing base installation only takes around a minute to complete in the cloud. While working, Ansible puts out status reports for the individual operations. If certain changes on the base system have been taken place already like when a playbook is run through one more time, the modules of course sense that and return just the information that the system haven t been changed because it s already there what have been wanted to change to. Beyond the basic playbook which is shown here there are more advanced features like register and when available to bind the execution of a play to the error-free result of a previous one. The apt module then can be used in the playbook to install the three needed binary packages one after another:
   - name: install apache2
     apt: pkg=apache2 state=present
   - name: install mod_wsgi
     apt: pkg=libapache2-mod-wsgi state=present
   - name: install httpbin
     apt: pkg=python-httpbin state=present
The Debian packages are configured in a way that the Apache web server is running immediately after installation, and the Apache module mod_wsgi is automatically integrated. If that would be otherwise desired, there are Ansible modules available for operating on Apache which can reverse this if that is wanted. By the way, after the package have been installed the httpbin server can be launched with python -m httpbin.core, but this runs only a mini web server which is not suitable for productive use. To get httpbin running on the Apache web server two configuration files are needed. They could be set up in the project directory on the control node and then uploaded onto the remote machine with another Ansible module. The file httpbin.wsgi (the name is again arbitrary) contains only a single line which is the starter for the WSGI application to run:
from httpbin import app as application
The module copy can be used to deploy that script on the host, while the target folder /var/www/httpbin must be set up before that by the module file. In addition to that, a separate user account like httpbin (the name is also arbitrary but picked up in the other config file) is needed to run it, and the module user can set this up. The demonstrational playbook continues, and the plays which are performing these three operations are going like this:
   - name: mkdir /var/www/httpbin
     file: path=/var/www/httpbin state=directory
   - name: set up user "httpbin"
     user: name=httpbin
   - name: copy WSGI starter
     copy: src=httpbin.wsgi dest=/var/www/httpbin/httpbin.wsgi owner=httpbin group=httpbin mode=0644 
Another configuration script httpbin.conf is needed for Apache on the remote server to include the WSGI application httpbin running as a virtual host. It goes like this:
<VirtualHost *>
 WSGIDaemonProcess httpbin user=httpbin group=httpbin threads=5
 WSGIScriptAlias / /var/www/httpbin/httpbin.wsgi
 <Directory /var/www/httpbin>
  WSGIProcessGroup httpbin
  WSGIApplicationGroup % GLOBAL 
  Order allow,deny
  Allow from all
 </Directory>
</VirtualHost>
This file needs to be copied into the folder /etc/apache2/sites-available on the host, which already exists when the apache2 package is installed. The remaining operations which are missing to get anything running together are: The default welcome screen of Apache blocks anything else and should be disabled by Apache s CLI tool a2dissite. And after that, the new virtual host needs to be activated with the complementary tool a2ensite both could be run remotely by the module command. Then the Apache server on the remote machine must be restarted to read in the new configuration. You ve guessed it already, that s all easy to perform with Ansible:
   - name: deploy configuration script
     copy: src=httpbin.conf dest=/etc/apache2/sites-available owner=root group=root mode=0644
   - name: deactivate default welcome screen
     command: a2dissite 000-default.conf
     
   - name: activate httpbin virtual host
     command: a2ensite httpbin.conf
   - name: restart Apache
     service: name=apache2 state=restarted 
That s it. After this playbook has been performed by Ansible on a (or several) freshly set up remote Debian base installation completely, then the httpbin request server is available running on the Apache web server and could be queried from anywhere by a web browser, or for example by curl:
$ curl http://192.0.2.0/user-agent
 
  "user-agent": "curl/7.50.1"
 
With the broad set of Ansible modules and the playbooks a lot of tasks can be accomplished like the example problem which has been explained here. But the range of functions of Ansible however is still even more comprehensive, but to discuss that would have blown the frame of this blog post. For example the playbooks offer more advanced features like event handler which can be used for recurring operations like the restart of Apache in more extensive projects. And beyond playbooks, templates could be set up in the roles which can behave differently on selected machine groups Ansible uses Jinja2 as template engine for that. And the scope of functions of the basic modules could be expanded by employing external tools. To drop a word on why it could be useful in certain situations to run own instances of the httpbin request server instead of using the official ones which are provided on the net by Runscope: Like some people would prefer to run a private instance for example in the local network instead of querying the one on the internet. Or for some development reasons a couple or even a large number of identical instances might be needed Ansible is ideal for setting them up automatically. Anyway, the Javascript bindings to the tracking services like Google Analytics in httpbin/templates/trackingscripts.html are patched out in the Debian package. That could be another reason to prefer to set up an own instance on a Debian server.

15 February 2017

Daniel Stender: APT programming snippets for Debian system maintenance

The Python API for the Debian package manager APT is useful for writing practical system maintenance scripts, which are going beyond shell scripting capabilities. There are Python2 and Python3 libraries for that available as packages, as well as a documentation in the package python-apt-doc. If that s also installed, the documentation then could be found in /usr/share/doc/python-apt-doc/html/index.html, and there are also a couple of example scripts shipped into /usr/share/doc/python-apt-doc/examples. The libraries mainly consists of Python bindings for the libapt-inst and libapt-pkg C++ core libraries of the APT package manager, which makes it processing very fast. Debugging symbols are also available as packages (python ,3 -apt-dbg). The module apt_inst provides features like reading from binary packages, while apt_pkg resembles the functions of the package manager. There is also the apt abstraction layer which provides more convenient access to the library, like apt.cache.Cache() could be used to behave like apt-get:
from apt.cache import Cache
mycache = Cache()
mycache.update()                   # apt-get update
mycache.open()                     # re-open
mycache.upgrade(dist_upgrade=True) # apt-get dist-upgrade
mycache.commit()                   # apply

boil out selections As widely known, there is a feature of dpkg which helps to move a package inventory from one installation to another by just using a text file with a list of installed packages. A selections list containing all installed package could be easily generated with $ dpkg --get-selections > selections.txt. The resulting file then looks something similar like this:
$ cat selections.txt
0ad                                 install
0ad-data                            install
0ad-data-common                     install
a2ps                                install
abi-compliance-checker              install
abi-dumper                          install
abigail-tools                       install
accountsservice                     install
acl                                 install
acpi                                install
The counterpart for this operation (--set-selections) could be used to reinstall (add) the complete package inventory on another installation resp. computer (that needs superuser rights), like that s explained in the manpage dpkg(1). No problem so far. The problem is, if that list contains a package which couldn t be found in any of the package inventories which are set up in /etc/apt/sources.list(.d/) on the target system, dpkg stops the whole process:
# dpkg --set-selections < selections.txt
dpkg: warning: package not in database at line 524: google-chrome-beta
dpkg: warning: found unknown packages; this might mean the available database
is outdated, and needs to be updated through a frontend method
Thus, manually downloaded and installed wild packages from unofficial package sources are problematic for this approach, because the package installer simply doesn t know where to get them. Luckily, dpkg puts out the relevant package names, but instead of having them removed manually with an editor this little Python script for python3-apt automatically deletes any of these packages from a selections file:
#!/usr/bin/env python3
import sys
import apt_pkg
apt_pkg.init()
cache = apt_pkg.Cache()
infile = open(sys.argv[1])
outfile_name = sys.argv[1] + '.boiled'
outfile = open(outfile_name, "w")
for line in infile:
    package = line.split()[0]
    if package in cache:
        outfile.write(line)
infile.close()
outfile.close()
sys.exit(0)
The script takes one argument which is the name of the selections file which has been generated by dpkg. The low level module apt_pkg first has to been initialized with apt_pkg.init(). Then apt_pkg.Cache() can be used to instantiate a cache object (here: cache). That object is iterable, thus it s easy to not perform something if a package from that list couldn t be found in the database, like not copying the corresponding line into the outfile (.boiled), while the others are copied. The result then looks something like this:
$ diff selections.txt selections.txt.boiled 
3780d3779
< python-timemachine   install
4438d4436
< wlan-supercracker    install
That script might be useful also for moving from one distribution resp. derivative to another (like from Ubuntu to Debian). For productive use, open() should be of course secured against FileNotFound and IOError-s to prevent program crashs on such events.

purge rc-s Like also widely known, deinstalled packages leave stuff like configuration files, maintainer scripts and logs on the computer, to save that if the package gets reinstalled at some point in the future. That happens if dpkg has been used with -r/--remove instead of -P/--purge, which also removes these files which are left otherwise. These packages are then marked as rc in the package archive, like:
$ dpkg -l   grep ^rc
rc  firebird2.5-common          2.5.6.27020.ds4-3   amd64   common files for firebird 2.5 servers and clients
rc  firebird2.5-server-common   2.5.6.27020.ds4-3   amd64   common files for firebird 2.5 servers
rc  firebird3.0-common          3.0.1.32609.ds4-8   all     common files for firebird 3.0 server, client and utilities
rc  imagemagick-common          8:6.9.6.2+dfsg-2    all     image manipulation programs -- infrastructure dummy package
It could be purged over them afterwards to completely remove them from the system. There are several shell coding snippets to be found on the net for completing this job automatically, like this one here:
dpkg -l   grep "^rc"   sed  e "s/^rc //"  e "s/ .*$//"   \
xargs dpkg  purge
The first thing which is needed to handle this by a Python script is the information that in apt_pkg, the package state rc per default is represented by the code 5:
>>> testpackage = cache['firebird2.5-common']
>>> testpackage.current_state
5
For changing things in the database apt_pkg.DepCache() could be docked onto an cache object to manipulate the installation state of a package within, like marking it to be removed resp. purged:
>>> mydepcache = apt_pkg.DepCache(mycache)
>>> mydepcache.mark_delete(testpackage, True) # True = purge
>>> mydepcache.marked_delete(testpackage)
True
That s basically all which is needed for an old package purging maintenance script in Python 3, another iterator as package filter and there you go:
#!/usr/bin/env python3
import sys
import apt_pkg
from apt.progress.text import AcquireProgress
from apt.progress.base import InstallProgress
acquire = AcquireProgress()
install = InstallProgress()
apt_pkg.init()
cache = apt_pkg.Cache()
depcache = apt_pkg.DepCache(cache)
for paket in cache.packages:
    if paket.current_state == 5:
        depcache.mark_delete(paket, True)
depcache.commit(acquire, install)
The method DepCache.commit() applies the changes to the package archive at the end, and it needs apt_progress to perform. Of course this script needs superuser rights to run. It then returns something like this:
$ sudo ./rc-purge 
Reading package lists... Done
Building dependency tree
Reading state information... Done
Fetched 0 B in 0s (0 B/s)
custom fork found
got pid: 17984
got pid: 0
got fd: 4
(Reading database ... 701434 files and directories currently installed.)
Purging configuration files for libmimic0:amd64 (1.0.4-2.3) ...
Purging configuration files for libadns1 (1.5.0~rc1-1) ...
Purging configuration files for libreoffice-sdbc-firebird (1:5.2.2~rc2-2) ...
Purging configuration files for vlc-nox (2.2.4-7) ...
Purging configuration files for librlog5v5 (1.4-4) ...
Purging configuration files for firebird3.0-common (3.0.1.32609.ds4-8) ...
Purging configuration files for imagemagick-common (8:6.9.6.2+dfsg-2) ...
Purging configuration files for firebird2.5-server-common (2.5.6.27020.ds4-3)
It s not yet production ready (like there s an infinite loop if dpkg returns error code 1 like from can t remove non empty folder ). But generally, ATTENTION: be very careful with typos and other mistakes if you want to use that code snippet, a false script performing changes on the package database might destroy the integrity of your system, and you don t want that to happen.

detect wild packages Like said above, installed Debian packages might be called wild if they have been downloaded from somewhere on the net and installed manually, like that is done from time to time on many systems. If you want to remove that whole class of packages again for any reason, the question would be how to detect them. A characteristic element is that there is no source connected to such a package, and that could be detected by Python scripting using again the bindings for the APT libraries. The package object doesn t have an associated method to query its source, because the origin is always connected to a specific package version, like some specific version might have come from security updates for example. The current version of a package can be queried with DepCache.get_candidate_ver() which returns a complex apt_pkg.Version object:
>>> import apt_pkg
>>> apt_pkg.init()
>>> mycache = apt_pkg.Cache()
Reading package lists... Done
Building dependency tree
Reading state information... Done
>>> mydepcache = apt_pkg.DepCache(mycache)
>>> testpackage = mydepcache.get_candidate_ver(mycache['nano'])
>>> testpackage
<apt_pkg.Version object: Pkg:'nano' Ver:'2.7.4-1' Section:'editors'  Arch:'amd64' Size:484790 ISize:2092032 Hash:33578 ID:31706 Priority:2>
For version objects there is the method file_list available, which returns a list containing PackageFile() objects:
>>> testpackage.file_list
[(<apt_pkg.PackageFile object: filename:'/var/lib/apt/lists/httpredir.debian.org_debian_dists_testing_main_binary-amd64_Packages'  a=testing,c=main,v=,o=Debian,l=Debian arch='amd64' site='httpredir.debian.org' IndexType='Debian Package Index' Size=38943764 ID:0>, 669901L)]
These file objects contain the index files which are associated with a specific package source (a downloaded package index), which could be read out easily (using a for-loop because there could be multiple file objects):
>>> for files in testpackage.file_list:
...     print(files[0].filename)
/var/lib/apt/lists/httpredir.debian.org_debian_dists_testing_main_binary-amd64_Packages
That explains itself: the nano binary package on this amd64 computer comes from httpredir.debian.org/debian testing main. If a package is wild that means it was installed manually, so there is no associated index file to be found, but only /var/lib/dpkg/status (libcudnn5 is not in the official package archives but distributed by Nvidia as a .deb package):
>>> testpackage2 = mydepcache.get_candidate_ver(mycache['libcudnn5'])
>>> for files in testpackage2.file_list:
...     print(files[0].filename)
/var/lib/dpkg/status
The simple trick now is to find all packages which have only /var/lib/dpkg/status as associated system file (that doesn t refer to what packages contain), an not an index file representing a package source. There s a little pitfall: that s truth also for virtual packages. But virtual packages commonly don t have an associated version (python-apt docs: to check whether a package is virtual; that is, it has no versions and is provided at least once ), and that can be queried by Package.has_versions. A filter to find out any packages that aren t virtual packages, are solely associated to one system file, and that file is /var/lib/dpkg/status, then goes like this:
for package in cache.packages:
    if package.has_versions:
        version = mydepcache.get_candidate_ver(package)
        if len(version.file_list) == 1:
            if 'dpkg/status' in version.file_list[0][0].filename:
                print(package.name)
On my Debian testing system this puts out a quite interesting list. It lists all the wild packages like libcudnn5, but also packages which are recently not in testing because they have been temporarily removed by AUTORM due to release critical bugs. Then there s all the obsolete stuff which have been installed from the package archives once and then being forgotten like old kernel header packages ( obsolete packages in dselect). So this snippet brings up other stuff, too. Thus, this might be more experimental stuff so far, though.

6 February 2017

Daniel Stender: Howto create a Debian 9 preview as Vagrant box with Packer

I ve got some little scripts and a template here to automatically create Vagrant boxes from cutting edge Debian testing daily snapshots (netinstall ISO image) using HashiCorp s Packer. To create Vagrant boxes with these, you first need a running binary of Packer. There is a Debian package available if that s also your working environment, but Packer is going to be introduced into the stable branch with the upcoming Stretch release itself. However, Ubuntu already has it, and some other derivatives, too. And there are prebuild binaries available from the developer s site which run fine out-of-the-box (you just have to put the single binary somewhere into you $PATH, or expand that to find it). The JSON template should run with any Packer which is available for any of the different systems. Vagrant itself isn t needed to build the box with Packer, but Virtualbox is of course needed to pre bake the machine image within a virtual machine. In Debian the base binaries of Virtualbox are in the contrib archive section, so that source might be added to /etc/apt/sources.list, if haven t already. The scripts have been tested to run with 5.1.10, and I haven t seen that any late versions are demanded in particular, but of course heavily outdated versions might not work properly. Packer installs the guest additions ISO file for Virtualbox into the virtual machine (and the shipped provisioning script then builds them inside). For that, the Debian package which ships that (which is in non-free) is recognized if it is installed, and then could be used by Packer. When the ISO isn t available in the places which Packer checks on the host the builder then automatically downloads the corresponding ISO from http://download.virtualbox.org/virtualbox. When the tarball with the scripts is unpacked, just do make create and the process should run through, if Packer and Virtualbox are available. If your environment doesn t have GNU Make nor wget you might want to copy the relevant lines from the Makefile and run it manually. But if it does, just do it like this:
/tmp/debian-testing-vagrantbox$ make create
virtualbox-iso output will be in this color.
==> virtualbox-iso: Downloading or copying Guest additions
    virtualbox-iso: Downloading or copying: file:///usr/share/virtualbox/VBoxGuestAdditions.iso
==> virtualbox-iso: Downloading or copying ISO
    virtualbox-iso: Downloading or copying: http://cdimage.debian.org/cdimage/daily-builds/daily/arch-latest/amd64/iso-cd/debian-testing-amd64-netinst.iso
    virtualbox-iso: Download progress: 10%
 ... 
    virtualbox-iso: Download progress: 96%
==> virtualbox-iso: Starting HTTP server on port 8219
==> virtualbox-iso: Creating virtual machine...
==> virtualbox-iso: Creating hard drive...
==> virtualbox-iso: Creating forwarded port mapping for communicator (SSH, WinRM, etc) (host port 2885)
==> virtualbox-iso: Starting the virtual machine...
==> virtualbox-iso: Waiting 10s for boot...
==> virtualbox-iso: Typing the boot command...
==> virtualbox-iso: Waiting for SSH to become available...
The Virtualbox window then pops up and the build process continues within the virtual machine for a while. You might want to file a Github issue when there s a problem on your machine, please! (please include the tail of your packer.log) The Packer template (debian-testing-vagrant.json) is described in the documentation of the virtualbox-iso builder. A preseeding script for the Debian Installer (preseed.cfg) is also included which gets injected into the virtual build environment during the build process. The creation progress of the Debian base installation could be easily monitored since the Virtualbox window is fully shown during the Packer run (if you loose your mouse pointer by clicking inside that window, do <Right>+<Control> to escape). For good performance, a fast internet connection is needed since a whole base system must be downloaded if that s available the whole automated process very only takes a couple of minutes to complete on a non-SSD machine. When Packer has finished and a fresh box is created (the size is about 690 MB), it then could be used with Vagrant. Just add the new box with:
/tmp/debian-testing-vagrantbox$ vagrant box add stretch-preview debian-testing-vagrant.box
==> box: Box file was not detected as metadata. Adding it directly...
==> box: Adding box 'stretch-preview' (v0) for provider: 
    box: Unpacking necessary files from: file:///tmp/debian-testing-vagrantbox/debian-testing-vagrant.box
==> box: Successfully added box 'stretch-preview' (v0) for 'virtualbox'!
It then could be initialized within a random working directory with:
/tmp/myproject$ vagrant init stretch-preview
A  Vagrantfile  has been placed in this directory. You are now
ready to  vagrant up  your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
 vagrantup.com  for more information on using Vagrant.
After that, you could launch the virtual box with:
/tmp/myproject$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'stretch-preview'...
==> default: Matching MAC address for NAT networking...
==> default: Setting the name of the VM: myproject_default_1486321215067_75270
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: 
    default: Vagrant insecure key detected. Vagrant will automatically replace
    default: this with a newly generated keypair for better security.
    default: 
    default: Inserting generated public key within guest...
    default: Removing insecure key from the guest if it's present...
    default: Key inserted! Disconnecting and reconnecting using new SSH key...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Mounting shared folders...
    default: /vagrant => /tmp/myproject
/tmp/myproject$
Then you can SSH into it by doing (touch is used here only to point to the shared folder):
/tmp/myproject$ touch hello!
/tmp/myproject$ vagrant ssh
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
/usr/bin/xauth:  file /home/vagrant/.Xauthority does not exist
vagrant@packer-virtualbox-iso-1486319595:~$ $ cat /etc/debian_version
9.0
vagrant@packer-virtualbox-iso-1486319595:~$ ls /vagrant/
hello!  Vagrantfile
vagrant@packer-virtualbox-iso-1486319595:~$ exit
logout
Connection to 127.0.0.1 closed.
/tmp/myproject$ vagrant halt
==> default: Attempting graceful shutdown of VM...
/tmp/myproject$
If you haven t worked with Vagrant before, maybe this is appealing. The user experience is somehwhat different from working with a chroot. Packer makes it very convenient to keep freshly created boxes for it coming. Have fun!

11 January 2017

Reproducible builds folks: Reproducible Builds: week 89 in Stretch cycle

What happened in the Reproducible Builds effort between Sunday January 1 and Saturday January 7 2017: GSoC and Outreachy updates Toolchain development Packages reviewed and fixed, and bugs filed Chris Lamb: Dhole: Reviews of unreproducible packages 13 package reviews have been added, 4 have been updated and 6 have been removed in this week, adding to our knowledge about identified issues. 2 issue types have been added/updated: Upstreaming of reproducibility fixes Merged: Opened: Weekly QA work During our reproducibility testing, the following FTBFS bugs have been detected and reported by: diffoscope development diffoscope 67 was uploaded to unstable by Chris Lamb. It included contributions from :
[ Chris Lamb ]
* Optimisations:
  - Avoid multiple iterations over archive by unpacking once for an ~8X
    runtime optimisation.
  - Avoid unnecessary splitting and interpolating for a ~20X optimisation
    when writing --text output.
  - Avoid expensive diff regex parsing until we need it, speeding up diff
    parsing by 2X.
  - Alias expensive Config() in diff parsing lookup for a 10% optimisation.
* Progress bar:
  - Show filenames, ELF sections, etc. in progress bar.
  - Emit JSON on the the status file descriptor output instead of a custom
    format.
* Logging:
  - Use more-Pythonic logging functions and output based on __name__, etc.
  - Use Debian-style "I:", "D:" log level format modifier.
  - Only print milliseconds in output, not microseconds.
  - Print version in debug output so that saved debug outputs can standalone
    as bug reports.
* Profiling:
  - Also report the total number of method calls, not just the total time.
  - Report on the total wall clock taken to execute diffoscope, including
    cleanup.
* Tidying:
  - Rename "NonExisting" -> "Missing".
  - Entirely rework diffoscope.comparators module, splitting as many separate
    concerns into a different utility package, tidying imports, etc.
  - Split diffoscope.difference into diffoscope.diff, etc.
  - Update file references in debian/copyright post module reorganisation.
  - Many other cleanups, etc.
* Misc:
  - Clarify comment regarding why we call python3(1) directly. Thanks to J r my
    Bobbio <lunar@debian.org>.
  - Raise a clearer error if trying to use --html-dir on a file.
  - Fix --output-empty when files are identical and no outputs specified.
[ Reiner Herrmann ]
* Extend .apk recognition regex to also match zip archives (Closes: #849638)
[ Mattia Rizzolo ]
* Follow the rename of the Debian package "python-jsbeautifier" to
  "jsbeautifier".
[ siamezzze ]
* Fixed no newline being classified as order-like difference.
reprotest development reprotest 0.5 was uploaded to unstable by Chris Lamb. It included contributions from:
[ Ximin Luo ]
* Stop advertising variations that we're not actually varying.
  That is: domain_host, shell, user_group.
* Fix auto-presets in the case of a file in the current directory.
* Allow disabling build-path variations. (Closes: #833284)
* Add a faketime variation, with NO_FAKE_STAT=1 to avoid messing with
  various buildsystems. This is on by default; if it causes your builds
  to mess up please do file a bug report.
* Add a --store-dir option to save artifacts.
Other contributions (not yet uploaded): reproducible-builds.org website development tests.reproducible-builds.org Misc. This week's edition was written by Chris Lamb, Holger Levsen and Vagrant Cascadian, reviewed by a bunch of Reproducible Builds folks on IRC & the mailing lists.

3 January 2017

Maria Glukhova: Getting to know diffoscope better

I apologize to all potential readers of this blog for not writing a comprehensive Introduction post with details of the project I am taking part in during my internship, as well as some story about how I ended up there. Let me just say that I was a Debian user for years when I discovered it is taking part in Outreachy as one of organisations. Their Reproducible Builds effort has a noble goal and a bunch of great people behind it - I had no chances not to get excited by it. Looking for a place where my skills could be of any use, I discovered diffoscope - the tool for in-depth comparassion of files, archives etc. My mentor, Mattia Rizzolo, supported my decision to work on it, so now I am concentrating my efforts on improving diffoscope. As my first steps, I am doing small (but hopefully still somewhat important) job of fixing existing bugs. It helps me to better understand how diffoscope works, as well as introduces me to the workflow of opensource development. During December, I have done several small contributions, mostly fixing bugs.

Test data and jessie-backports First of them could be somewhat called cleaning up after my own mistake, although that mistake wasn t trivial. During the application period, I have fixed a bug with diffoscope failing while comparing symlinks to directory. That was a small change, but I included some tests for that case anyway. And that actually caused problems. With these tests, I included test data: two folders with symlinks. All was good in unstable version of Debian, but in jessie-backports, that commit caused build to fail. After some digging, I discovered the problem was caused by build process including copying that data. That was done using shutils Python module, and older version of that module, included in jessie, could not handle copying symlinks to directory properly. Thanks to my mentor for giving me a hint on how to resolve this: using temporary folders and creating these symlinks at runtime. That way, we ensured tests run without problems during build process on jessie. What have I learned: A great deal, actually. I spent too much time on that one, but I learned how to build packages, what happens during dpkg-buildpackage run and what debhelper tools are for. I also learned a bit about what chroot is and how to use it for testing.

ICC profile files and file type recognizing regexp Another one was also about failing tests and, therefore, failing build. Failing tests were all due to ICC files were not recognized by diffoscope. Turned out libmagic got an update which changed the description of ICC profile files. Diffoscope was relying on regexp applied to file type description to recognize the file, so I changed regexp to reflect the changes in libmagic. What have I learned: How diffoscope recognizes file types. Got me thinking: maybe there is a better way? That regexp-based approach is doomed to cause problems with every file type description change. I have this question still lingering in my mind - maybe I will come up with an idea later.

Order-like difference in text files Next, I decided to do something a bit bigger and fullfilled a feature request. That request was for detecting order-like difference in text files (when files has the same lines, but in different order). I did it by collecting added and removed lines in diff output in lists, sorting and then comparing them. Sadly, I forgot about one particular case - when one of the files is missing the newline at the end of file. I was kindly reminded of that quite soon in comments on the bug-tracker (thanks danielsh!) and have already fixed that. I also recieved feedback on how better implement it deeper in the diffoscope - not using the results of diff, but rather comparing sum of hashes of the lines directly in the difference module. I am yet to try that. What have I learned: That a call to diff is actually the slowest part of the diffoscope run when done on two big text files. Could it help somehow in speeding it up? I don t know yet. I also learned to comment on bugs in Debian bugtracker and was surprised by how much feedback I got. Thanks to my mentor for pushing me to do that - I definetely need to overcome my fear of communications to be more effective!

Random FTBFS There was also a very nasty bug that caused diffoscope to fail to be built from source randomly, failing with non-informative Fatal Python error: deallocated None. It already seemed strange when it was first reported; It got only more strange when suddenly that bug ceased to be reproducible. We hoped that would mean that bug was caused by some external tool, and was fixed there. Turns out it was not that easy. I tested this on two separate computers and on virtual machine; I used different versions of diffoscope. Well. Seems like that bug is still somehow tied to diffoscope version and not some external tool version - I still can do git checkout 64 and be able to reproduce the bug (still randomly, though). Although I spent quite a lot of time on that one, the only result was the information about connection between bug apperances and diffoscope version. I still wasn t able to get to the root of the problem - hopefully, someone else will be able to, given the information I found. What have I learned: git-bisect! Thanks to my friend for pointing me to it, that tool came handy in that situation. Also, got some experience in catching nasty bugs like that (pity that no experience in squashing them). I had some extra time commitements in December, one of them (Reproducible Builds Summit II) connected to my internship and one (my exam session in university) not. In January, I should be able to allocate more time to that work - I hope it will help me achieve more significant results. Many thanks to Mattia Rizzolo, Chris Lamb, Holger Levsen and all other folks of Reproducible Builds project - I cannot stress enough how important your support is to me. Wish you all a great 2017!

18 December 2016

Daniel Stender: How to cheat setuptools-scm (Debian diary)

[2016-12-19: some additions] This is another little issue from Python packaging for Debian which I came across lately packaging the compressed NumPy based data container Bcolz. Upstream uses setuptools-scm to determine the software s version during build time from the source code management environment the code is in. This method is convenient for the upstream development because with that the version number doesn t need to be hard-coded, and often people just forget to update that (and other version carrying files like doc/conf.py) when a new version of a project is released. python-setuptools just needs to be added to the setup.py to do its job, and in Bcolz the code goes like this:
setup(
    name="bcolz",
    use_scm_version= 
        'version_scheme': 'guess-next-dev',
        'local_scheme': 'dirty-tag',
        'write_to': 'bcolz/version.py'
The file the version number is written to is bcolz/version.py. This file isn t in the upstream code revision nor in the tarball which was released by the upstream developers, it s always generated during build time. In Debian there is an error if you try to build a package from a source tree which contains files which aren t to be found in the corresponding tarball, like cruft from a previous build, or if any files have changed therefore every new package should be test build also twice in a row in a non-chroot environment. Generally there a two ways to solve this, either you add cruft to debian/clean, or you add the file resp. a matching file pattern to extend-diff-ignore in debian/source/options. Which method is the better one could be discussed, I m generally using the clean option if something isn t in the upstream tarball, and the source/options solution if something is already in the upstream tarball, but gets changed during a build. This is related to your preferred Git procedures, if you remove a file which is in the upstream tarball these removals have to be checked in separately, and that means everytime a new upstream tarball is released that is not very convenient. Another option which is available is to strip certain files from the upstream tarball by putting them on the Files-Excluded in deb/copyright. By the way, the same complex applies to egg-info/: that folder is shipped or is not shipped in the upstream tarball, and files in that folder get changed during build. When the source code is put into a Git environment for Debian packaging, there could be problems with the version number setuptools-scm comes up with. This setuptools extension gets the recent version from the latest Git tag when there is a version number to be found, and that s all right. In Git environments for Debian packaging (like e.g. of the Debian Science group, the Python groups and the others) that is available, like the commonly used upstream tags have that1. The problem is, sometimes the upstream version which Debian has2 doesn t match the original upstream version number which is wanted for version in bcolz/version.py. For example, the suffix +ds is used if the upstream tarball has been stripped from prebuild files or embedded convenience shipments (like it s the case with the Bcolz package where c-blosc/ has been stripped because that s build for another package), of the suffix +dfsg shows that non-DFSG free software has been removed (which can t be distributed through the main archive section). For example, the version string for Bcolz which is found after the build currently (1.1.0+ds1-1) is 1.1.0+ds1:
# coding: utf-8
# file generated by setuptools_scm
# don't change, don't track in version control
version = '1.1.0+ds1'
But that s not wanted because this version never has been released, but appears everywhere:
$ pip list   grep bcolz
bcolz (1.1.0+ds1)
$ python3 -c 'import bcolz; bcolz.print_versions()'
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
bcolz version:     1.1.0+ds1
There are several different ways how to fix this. The one with the crowbar (like said in German) is to patch use_scm_version out from setup.py, but if you don t provide any version in exchange the version number which is used by Setuptools then is 0.0.0. The upstream version could be hard-coded into the patch, but then again it has not to be forgotten to update it manually by the maintainer, which is not very convenient. Plus, setup.py could change and the patch then might need to be unfuzzed, thus more work. Bad. A patch could be spared by manipulating and exporting the SETUPTOOLS_SCM_PRETEND_VERSION environment variable for setuptools-scm in debian/rules, which is sometimes used when I see the returns for that string on Debian Code Search. But how to prevent to hard code the version number, here? The dpkg-dev package (pulled by build-essential) ships a Makefile snippet /usr/share/dpkg/pkg-info.mk which could be included into debian/rules. It defines several variables which are useful for packaging, like DEB_SOURCE contains the source package name, they are extracted from debian/changelog. But, DEB_VERSION_UPSTREAM which is available through that puts out the upstream version without epoch and Debian revision, but it s not getting any finer grained out of the box. For a custom fix, a regular expression which removes the +... extensions (if present) from the bare upstream version string would be s/\+[^+]*//:
$ echo "1.1.0+ds1"   sed -e 's/\+[^+]*//'
1.1.0
$ echo "1.1.0"   sed -e 's/\+[^+]*//'
1.1.0
$ echo "1.1.0+dfsg12"   sed -e 's/\+[^+]*//'
1.1.0
With that, a custom variable VERSION_UPSTREAM could be set on the top of DEB_VERSION_UPSTREAM (from pkg-info.mk) in debian/rules:
include /usr/share/dpkg/pkg-info.mk
VERSION_UPSTREAM = $(shell echo '$(DEB_VERSION_UPSTREAM)'   sed -e 's/\+[^+]*//')
export SETUPTOOLS_SCM_PRETEND_VERSION = $(VERSION_UPSTREAM)
Bam, that works (see the commit here):
# coding: utf-8
# file generated by setuptools_scm
# don't change, don't track in version control
version = '1.1.0'
An addition, I ve seen that dh-python also takes care of SETUPTOOLS_SCM_PRETEND_VERSION since 2.20160609. The environment variable is set by the Debhelper build system if python 3, -setuptools-scm is among the build-dependencies in debian/control. The Perl code for that is in dh/pybuild.pm. I think the version number string above comes from dh-python s pretended version, and not from any of the Git tags (which are currently debian/1.1.0+ds1-1 and upstream/1.1.0+ds1).

  1. For Git in Debian packaging, e.g. see the DEP-14 proposal (Recommended layout for Git packaging repositories): http://dep.debian.net/deps/dep14/ [return]
  2. Following the scheme for package versions [epoch:]upstream_version[-debian-revision] [return]

30 August 2016

Daniel Stender: My work for Debian in August

Here's again a little list of my humble off-time contributions I'm happy to add to the large amount of work we're completing all together each month. Then there is one more "new in Debian" (meaning: "new in unstable") announcement. First, the uploads (a few of them are from July): New packages: Sponsored uploads: Requested resp. suggested for packaging: New in Debian: Lasagne (deep learning framework) Now that the mathematical expression compiler Theano is available in Debian, deep learning frameworks resp. toolkits which have been build on top of it can become available within Debian, too (like Blocks, mentioned before). Theano is an own general computing engine which has been developed with a focus on machine learning resp. neural networks, which features an own declarative tensor language. The toolkits which have build upon it vary in the way how much they abstract the bare features of Theano, if they are "thick" or "thin" so to say. When the abstraction gets higher you gain more end user convenience up to the level that you have the architectural components of neural networks available for combination like in a lego box, while the more complicated things which are going on "under the hood" (like how the networks are actually implemented) are hidden. The downside is, thick abstraction layers usually makes it difficult to implement novel features like custom layers or loss functions. So more experienced users and specialists might to seek out for the lower abstraction toolkits, where you have to think more in terms of Theano. I've got an initial package of Keras in experimental (1.0.7-1), it runs (only a Python 3 package is available so far) but needs some more work (e.g. building the documentation with mkdocs). Keras is a minimalistic, high modular DNN library inspired by Torch1. It has a clean, rather easy API for experimenting and fast prototyping. It can also run on top of Google's TensorFlow, and we're going to have it ready for that, too. Lasagne follows a different approach. It's, like Keras and Blocks, a Python library to create and train multi-layered artificial neural networks in/on Theano for applications like image recognition resp. classification, speech recognition, image caption generation or other purposes like style transfers from paintings to pictures2. It abstracts Theano as little as possible, and could be seen rather like an extension or an add-on than an abstraction3. Therefore, knowledge on how things are working in Theano would be needed to make full use out of this piece of software. With the new Debian package (0.1+git20160728.8b66737-1)4, the whole required software stack (the corresponding Theano package, NumPy, SciPy, a BLAS implementation, and the nividia-cuda-toolkit and NVIDIA kernel driver to carry out computations on the GPU5) could be installed most conveniently just by a single apt-get install python ,3 -lasagne command6. If wanted with the documentation package lasagne-doc for offline use (no running around on remote airports seeking for a WIFI spot), either in the Python 2 or the Python 3 branch, or both flavours altogether7. While others have to spend a whole weekend gathering, compiling and installing the needed libraries you can grab yourself a fresh cup of coffee. These are the advantages of a fully integrated system (sublime message, as always: desktop users switch to Linux!). When the installation of packages has completed, the MNIST example of Lasagne could be used for a quick check if the whole library stack works properly8:
$ THEANO_FLAGS=device=gpu,floatX=float32 python /usr/share/doc/python-lasagne/examples/mnist.py mlp 5
Using gpu device 0: GeForce 940M (CNMeM is disabled, cuDNN 5005)
Loading data...
Downloading train-images-idx3-ubyte.gz
Downloading train-labels-idx1-ubyte.gz
Downloading t10k-images-idx3-ubyte.gz
Downloading t10k-labels-idx1-ubyte.gz
Building model and compiling functions...
Starting training...
Epoch 1 of 5 took 2.488s
  training loss:        1.217167
  validation loss:      0.407390
  validation accuracy:      88.79 %
Epoch 2 of 5 took 2.460s
  training loss:        0.568058
  validation loss:      0.306875
  validation accuracy:      91.31 %
The example on how to train a neural network on the MNIST database of handwritten digits is refined (it also provides --help) and explained in detail in the Tutorial section of the documentation in /usr/share/doc/lasagne-doc/html/. Very good starting points are also the IPython notebooks that are available from the tutorials by Eben Olson9 and Geoffrey French on the PyData London 201610. There you have Theano basics, examples for employing convolutional neural networks (CNN) and recurrent neural networks (RNN) for a range of different purposes, how to use pre-trained networks for image recognition, etc.

  1. For a quick comparison of Keras and Lasagne with other toolkits, see Alex Rubinsteyn's PyData NYC 2015 presentation on using LSTM (long short term memory) networks on varying length sequence data like Grimm's fairy tales (https://www.youtube.com/watch?v=E92jDCmJNek 27:30 sq.)
  2. https://github.com/Lasagne/Recipes/tree/master/examples/styletransfer
  3. Great introduction to Theano and Lasagne by Eben Olson on the PyData NYC 2015: https://www.youtube.com/watch?v=dtGhSE1PFh0
  4. The package is "freelancing" currently being in collab-maint, to set up a deep learning packaging team within Debian is in the stage of discussion.
  5. Only available for amd64 and ppc64el.
  6. You would need "testing" as package source in /etc/apt/sources.list to install it from the archive at the present time (I have that for years, but if Debian Testing could be advised as productive system is going to be discussed elsewhere), but it's coming up for Debian 9. The cuda-toolkit and pycuda are in the non-free section of the archive, thus non-free (mostly used in combination with contrib) must be added to main. Plus, it's a mere suggestion of the Theano packages to keep Theano in main, so --install-suggests is needed to pull it automatically with the same command, or this must be given explicitly.
  7. For dealing with Theano in Debian, see this previous blog posting
  8. Like suggested in the guideline From Zero to Lasagne on Ubuntu 14.04. cuDNN isn't available as official Debian package yet, but could be downloaded as a .deb package after registration at https://developer.nvidia.com/cudnn. It integrates well out of the box.
  9. https://github.com/ebenolson/pydata2015
  10. https://github.com/Britefury/deep-learning-tutorial-pydata2016, video: https://www.youtube.com/watch?v=DlNR1MrK4qE

23 August 2016

Reproducible builds folks: Reproducible Builds: week 69 in Stretch cycle

What happened in the Reproducible Builds effort between Sunday August 14 and Saturday August 20 2016: Fasten your seatbelts Important note: we enabled build path variation for unstable now, so your package(s) might become unreproducible, while previously it was said to be reproducible given a specific build path it probably still is reproducible but read on for the details below in the tests.reproducible-builds.org section! As said many times: this is still research and we are working to make it reality. Media coverage Daniel Stender blogged about python packaging and explained some caveats regarding reproducible builds. Toolchain developments Thomas Schmitt uploaded xorriso which now obeys SOURCE_DATE_EPOCH. As stated in its man pages:
ENVIRONMENT
[...]
SOURCE_DATE_EPOCH  belongs to the specs of reproducible-builds.org.  It
is supposed to be either undefined or to contain a decimal number which
tells the seconds since january 1st 1970. If it contains a number, then
it is used as time value to set the  default  of  --modification-date=,
--gpt_disk_guid,  and  --set_all_file_dates.  Startup files and program
options can override the effect of SOURCE_DATE_EPOCH.
Packages reviewed and fixed, and bugs filed The following packages have become reproducible after being fixed: The following updated packages appear to be reproducible now, for reasons we were not able to figure out. (Relevant changelogs did not mention reproducible builds.) The following 2 packages were not changed, but have become reproducible due to changes in their build-dependencies: tagsoup tclx8.4. Some uploads have addressed some reproducibility issues, but not all of them: Patches submitted that have not made their way to the archive yet: Bug tracker house keeping: Reviews of unreproducible packages 55 package reviews have been added, 161 have been updated and 136 have been removed in this week, adding to our knowledge about identified issues. 2 issue types have been updated: Weekly QA work FTBFS bugs have been reported by: diffoscope development Chris Lamb, Holger Levsen and Mattia Rizzolo worked on diffoscope this week. Improvements were made to SquashFS and JSON comparison, the https://try.diffoscope.org/ web service, documentation, packaging, and general code quality. diffoscope 57, 58, and 59 were uploaded to unstable by Chris Lamb. Versions 57 and 58 were both broken, so Holger set up a job on jenkins.debian.net to test diffoscope on each git commit. He also wrote a CONTRIBUTING document to help prevent this from happening in future. From these efforts, we were also able to learn that diffoscope is now reproducible even when built across multiple architectures:
< h01ger>   https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/diffoscope.html shows these packages were built on amd64:
< h01ger>    bd21db708fe91c01ba1c9cb35b9d41a7c9b0db2b 62288 diffoscope_59_all.deb
< h01ger>    366200bf2841136a4c8f8c30bdc87057d59a4cdd 20146 trydiffoscope_59_all.deb
< h01ger>   and on i386:
< h01ger>    bd21db708fe91c01ba1c9cb35b9d41a7c9b0db2b 62288 diffoscope_59_all.deb
< h01ger>    366200bf2841136a4c8f8c30bdc87057d59a4cdd 20146 trydiffoscope_59_all.deb
< h01ger>   and on armhf:
< h01ger>    bd21db708fe91c01ba1c9cb35b9d41a7c9b0db2b 62288 diffoscope_59_all.deb
< h01ger>    366200bf2841136a4c8f8c30bdc87057d59a4cdd 20146 trydiffoscope_59_all.deb
And those also match the binaries uploaded by Chris in his diffoscope 59 binary upload to ftp.debian.org, yay! Eating our own dogfood and enjoying it! tests.reproducible-builds.org Debian related: The last change probably will have an impact you will see: your package might become unreproducible in unstable and this will be shown on tracker.debian.org, while it will still be reproducible in testing. We've done this, because we think reproducible builds are possible with arbitrary build paths. But: we don't think those are a realistic goal for stretch, where we still recommend to use .buildinfo to record the build patch and then do rebuilds using that path. We are doing this, because besides doing theoretical groundwork we also have a practical goal: enable users to independently verify builds. And if they only can do this with a fixed path, so be it. For now :) To be clear: for Stretch we recommend that reproducible builds are done in the same build path as the "original" build. Finally, and just for our future references, when we enabled build path variation on Saturday, August 20th 2016, the numbers for unstable were:
suite all reproducible unreproducible ftbfs depwait not for this arch blacklisted
unstable/amd64 24693 21794 (88.2%) 1753 (7.1%) 972 (3.9%) 65 (0.2%) 95 (0.3%) 10 (0.0%)
unstable/i386 24693 21182 (85.7%) 2349 (9.5%) 972 (3.9%) 76 (0.3%) 103 (0.4%) 10 (0.0%)
unstable/armhf 24693 20889 (84.6%) 2050 (8.3%) 1126 (4.5%) 199 (0.8%) 296 (1.1%) 129 (0.5%)
Misc. Ximin Luo updated our git setup scripts to make it easier for people to write proper descriptions for our repositories. This week's edition was written by Ximin Luo and Holger Levsen and reviewed by a bunch of Reproducible Builds folks on IRC.

20 August 2016

Daniel Stender: Collected notes from Python packaging

Here are some collected notes on some particular problems from packaging Python stuff for Debian, and more are coming up like this in the future. Some of the issues discussed here might be rather simple and even benign for the experienced packager, but maybe this is be helpful for people coming across the same issues for the first time, wondering what's going wrong. But some of the things discussed aren't easy. Here are the notes for this posting, there is no particular order. UnicodeDecoreError on open() in Python 3 running in non-UTF-8 environments I've came across this problem again recently packaging httpbin 0.5.0. The build breaks the following way e.g. trying to build with sbuild in a chroot, that's the first run of setup.py with the default Python 3 interpreter:
I: pybuild base:184: python3.5 setup.py clean 
Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    os.path.join(os.path.dirname(__file__), 'README.rst')).read()
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2386: ordinal not in range(128)
E: pybuild pybuild:274: clean: plugin distutils failed with: exit code=1: python3.5 setup.py clean 
One comes across UnicodeDecodeError-s quite oftenly on different occasions, mostly related to Python 2 packaging (like here). Here it's the case that in setup.py the long_description is tried to be read from the opened UTF-8 encoded file README.rst:
long_description = open(
    os.path.join(os.path.dirname(__file__), 'README.rst')).read()
This is a problem for Python 3.5 (and other Python 3 versions) when setup.py is executed by an interpreter run in a non-UTF-8 environment1:
$ LANG=C.UTF-8 python3.5 setup.py clean
running clean
$ LANG=C python3.5 setup.py clean
Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    os.path.join(os.path.dirname(__file__), 'README.rst')).read()
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2386: ordinal not in range(128)
$ LANG=C python2.7 setup.py clean
running clean
The reason for this error is, the default encoding for file object returned by open() e.g. in Python 3.5 is platform dependent, so that read() fails on that when there's a mismatch:
>>> readme = open('README.rst')
>>> readme
<_io.TextIOWrapper name='README.rst' mode='r' encoding='ANSI_X3.4-1968'>
Non-UTF-8 build environments because $LANG isn't particularly set at all or set to C are common or even default in Debian packaging, like in the continuous integration resp. test building for reproducible builds the primary environment features that (see here). That's also true for the base system of the sbuild environment:
$ schroot -d / -c unstable-amd64-sbuild -u root
(unstable-amd64-sbuild)root@varuna:/# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
A problem like this is solved mostly elegant by installing some workaround in debian/rules. A quick and easy fix is to add export LC_ALL=C.UTF-8 here, which supersedes the locale settings of the build environment. $LC_ALL is commonly used to change the existing locale settings, it overrides all other locale variables with the same value (see here). C.UTF-8 is an UTF-8 locale which is available by default in a base system, it could be used without installing the locales package (or worse, the huge locales-all package):
(unstable-amd64-sbuild)root@varuna:/# locale -a
C
C.UTF-8
POSIX
This problem ideally should be taken care of upstream. In Python 3, the default open() is io.open(), for which the specific encoding could be given, so that the UnicodeDecodeError disappears. Python 2 uses os.open() for open(), which doesn't support that, but has io.open(), too. A fix which works for both Python branches goes like this:
import io
long_description = io.open(
    os.path.join(os.path.dirname(__file__), 'README.rst'), encoding='utf-8').read()
non-deterministic order of requirements in egg-info/requires.txt This problem appeared in prospector/0.11.7-5 in the reproducible builds test builds, that was the first package of Prospector running on Python 32. It was revealed that there were differences in the sorting order of the [with_everything] dependencies resp. requirements in prospector-0.11.7.egg-info/requires.txt if the package was build on varying systems:
$ debbindiff b1/prospector_0.11.7-5_amd64.changes b2/prospector_0.11.7-5_amd64.changes 
 ... 
  prospector_0.11.7-5_all.deb
      file list
      @@ -1,3 +1,3 @@
       -rw-r--r--   0        0        0        4 2016-04-01 20:01:56.000000 debian-binary
      --rw-r--r--   0        0        0     4343 2016-04-01 20:01:56.000000 control.tar.gz
      +-rw-r--r--   0        0        0     4344 2016-04-01 20:01:56.000000 control.tar.gz
       -rw-r--r--   0        0        0    74512 2016-04-01 20:01:56.000000 data.tar.xz
      control.tar.gz
          control.tar
              ./md5sums
                  md5sums
                  Files in package differ
      data.tar.xz
          data.tar
              ./usr/share/prospector/prospector-0.11.7.egg-info/requires.txt
              @@ -1,12 +1,12 @@
               
               [with_everything]
              +pyroma>=1.6,<2.0
               frosted>=1.4.1
               vulture>=0.6
              -pyroma>=1.6,<2.0
Reproducible builds folks recognized this to be a toolchain problem and set up the issue randonmness_in_python_setuptools_requires.txt to cover this. Plus, a wishlist bug against python-setuptools was filed to fix this. The patch which was provided by Chris Lamb adds sorting of dependencies in requires.txt for Setuptools by adding sorted() (stdlib) to _write_requirements() in command/egg_info.py:
--- a/setuptools/command/egg_info.py
+++ b/setuptools/command/egg_info.py
@@ -406,7 +406,7 @@ def warn_depends_obsolete(cmd, basename, filename):
 def _write_requirements(stream, reqs):
     lines = yield_lines(reqs or ())
     append_cr = lambda line: line + '\n'
-    lines = map(append_cr, lines)
+    lines = map(append_cr, sorted(lines))
     stream.writelines(lines)
O.k. In the toolchain, nothing sorts these requirements alphanumerically to make differences disappear, but what is the reason for these differences in the Prospector packages, though? The problem is somewhat subtle. In setup.py, [with_everything] is a dictionary entry of _OPTIONAL (which is used for extras_require) that is created by a list comprehension out of the other values in that dictionary. The code goes like this:
_OPTIONAL =  
    'with_frosted': ('frosted>=1.4.1',),
    'with_vulture': ('vulture>=0.6',),
    'with_pyroma': ('pyroma>=1.6,<2.0',),
    'with_pep257': (),  # note: this is no longer optional, so this option will be removed in a future release
 
_OPTIONAL['with_everything'] = [req for req_list in _OPTIONAL.values() for req in req_list]
The result, the new _OPTIONAL dictionary including the key with_everything (which w/o further sorting is the source of the list of requirements requires.txt) e.g. looks like this (code snipped run through in my IPython):
In [3]: _OPTIONAL
Out[3]: 
 'with_everything': ['vulture>=0.6', 'pyroma>=1.6,<2.0', 'frosted>=1.4.1'],
 'with_vulture': ('vulture>=0.6',),
 'with_pyroma': ('pyroma>=1.6,<2.0',),
 'with_frosted': ('frosted>=1.4.1',),
 'with_pep257': () 
That list comprehension iterates over the other dictionary entries to gather the value of with_everything, and Python programmers know that of course dictionaries are not indexed and therefore the order of the key-value pairs isn't fixed, but is determined by certain conditions. That's the source for the non-determinism of this Debian package revision of Prospector. This problem has been fixed by a patch, which just presorts the list of requirements before it gets added to _OPTIONAL:
@@ -76,8 +76,8 @@
     'with_pyroma': ('pyroma>=1.6,<2.0',),
     'with_pep257': (),  # note: this is no longer optional, so this option will be removed in a future release
  
-_OPTIONAL['with_everything'] = [req for req_list in _OPTIONAL.values() for req in req_list]
-
+with_everything = [req for req_list in _OPTIONAL.values() for req in req_list]
+_OPTIONAL['with_everything'] = sorted(with_everything)
In comparison to the list method sort(), the function sorted() does not change iterables in-place, but returns a new object. Both could be used. As a side note, egg-info/requires.txt isn't even needed, but that's another issue.

  1. As an alternative to prefixing LC_ALL, env -i could be used to get an empty environment.
  2. 0.11.7-4 already but this package revision was in experimental due to the switch to Python 3 and therefore not tested by reproducible builds continuous integration.

20 July 2016

Daniel Stender: Theano in Debian: maintenance, BLAS and CUDA

I'm glad to announce that we have the current release of Theano (0.8.2) in Debian unstable now, it's on its way into the testing branch and the Debian derivatives, heading for Debian 9. The Debian package is maintained in behalf of the Debian Science Team. We have a binary package with the modules in the Python 2.7 import path (python-theano), if you want or need to stick to that branch a little longer (as a matter of fact, in the current popcon stats it's the most installed package), and a package running on the default Python 3 version (python3-theano). The comprehensive documentation is available for offline usage in another binary package (theano-doc). Although Theano builds its extensions on run time and therefore all binary packages contain the same code, the source package generates arch specific packages1 for the reason that the exhaustive test suite could run over all the architectures to detect if there are problems somewhere (#824116). what's this? In a nutshell, Theano is a computer algebra system (CAS) and expression compiler, which is implemented in Python as a library. It is named after a Classical Greek female mathematician and it's developed at the LISA lab (located at MILA, the Montreal Institute for Learning Algorithms) at the Universit de Montr al. Theano tightly integrates multi-dimensional arrays (N-dimensional, ND-array) from NumPy (numpy.ndarray), which are broadly used in Scientific Python for the representation of numeric data. It features a declarative Python based language with symbolic operations for the functional definition of mathematical expressions, which allows to create functions that compute values for them. Internally the expressions are represented as directed graphs with nodes for variables and operations. The internal compiler then optimizes those graphs for stability and speed and then generates high-performance native machine code to evaluate resp. compute these mathematical expressions2. One of the main features of Theano is that it's capable to compute also on GPU processors (graphical processor unit), like on custom graphic cards (e.g. the developers are using a GeForce GTX Titan X for benchmarks). Today's GPUs became very powerful parallel floating point devices which can be employed also for scientific computations instead of 3D video games3. The acronym "GPGPU" (general purpose graphical processor unit) refers to special cards like NVIDIA's Tesla4, which could be used alike (more on that below). Thus, Theano is a high-performance number cruncher with an own computing engine which could be used for large-scale scientific computations. If you haven't came across Theano as a Pythonistic professional mathematician, it's also one of the most prevalent frameworks for implementing deep learning applications (training multi-layered, "deep" artificial neural networks, DNN) around5, and has been developed with a focus on machine learning from the ground up. There are several higher level user interfaces build in the top of Theano (for DNN, Keras, Lasagne, Blocks, and others, or for Python probalistic programming, PyMC3). I'll seek for some of them also becoming available in Debian, too. helper scripts Both binary packages ship three convenience scripts, theano-cache, theano-test, and theano-nose. Instead of them being copied into /usr/bin, which would result into a binaries-have-conflict violation, the scripts are to be found in /usr/share/python-theano (python3-theano respectively), so that both module packages of Theano can be installed at the same time. The scripts could be run directly from these folders, e.g. do $ python /usr/share/python-theano/theano-nose to achieve that. If you're going to heavy use them, you could add the directory of the flavour you prefer (Python 2 or Python 3) to the $PATH environment variable manually by either typing e.g. $ export PATH=/usr/share/python-theano:$PATH on the prompt, or save that line into ~/.bashrc. Manpages aren't available for these little helper scripts6, but you could always get info on what they do and which arguments they accept by invoking them with the -h (for theano-nose) resp. help flag (for theano-cache). running the tests On some occasions you might want to run the testsuite of the installed library, like to check over if everything runs fine on your GPU hardware. There are two different ways to run the tests (anyway you need to have python ,3 -nose installed). One is, you could launch the test suite by doing $ python -c 'import theano; theano.test() (or the same with python3 to test the other flavour), that's the same what the helper script theano-test does. However, by doing it that way some particular tests might fail by raising errors also for the group of known failures. Known failures are excluded from being errors if you run the tests by theano-nose, which is a wrapper around nosetests, so this might be always the better choice. You can run this convenience script with the option --theano on the installed library, or from the source package root, which you could pull by $ sudo apt-get source theano (there you have also the option to use bin/theano-nose). The script accept options for nosetests, so you might run it with -v to increase verbosity. For the tests the configuration switch config.device must be set to cpu. This will also include the GPU tests when a proper accessible device is detected, so that's a little misleading in the sense of it doesn't mean "run everything on the CPU". You're on the safe side if you run it always like this: $ THEANO_FLAGS=device=cpu theano-nose, if you've set config.device to gpu in your ~/.theanorc. Depending on the available hardware and the used BLAS implementation (see below) it could take quite a long time to run the whole test suite through, on the Core-i5 in my laptop that takes around an hour even excluded the GPU related tests (which perform pretty fast, though). Theano features a couple of switches to manipulate the default configuration for optimization and compilation. There is a rivalry between optimization and compilation costs against performance of the test suite, and it turned out the test suite performs a quicker with lesser graph optimization. There are two different switches available to control config.optimizer, the fast_run toggles maximal optimization, while fast_compile runs only a minimal set of graph optimization features. These settings are used by the general mode switches for config.mode, which is either FAST_RUN by default, or FAST_COMPILE. The default mode FAST_RUN (optimizer=fast_run, linker=cvm) needs around 72 minutes on my lower mid-level machine (on un-optimized BLAS). To set mode=FAST_COMPILE (optimizer=fast_compile, linker=py) brings some boost for the performance of the test suite because it runs the whole suite in 46 minutes. The downside of that is that C code compilation is disabled in this mode by using the linker py, and also the GPU related tests are not included. I've played around with using the optimizer fast_compile with some of the other linkers (c py and cvm, and their versions without garbage collection) as alternative to FAST_COMPILE with minimal optimization but also machine code compilation incl. GPU testing. But to my experience, fast_compile without another than the linker py results in some new errors and failures of some tests on amd64, and this might the case also on other architectures, too. By the way, another useful feature is DebugMode for config.mode, which verifies the correctness of all optimizations and compares the C to Python results. If you want to have detailed info on the configuration settings of Theano, do $ python -c 'import theano; print theano.config' less, and check out the chapter config in the library documentation in the documentation. cache maintenance Theano isn't a JIT (just-in-time) compiler like Numba, which generates native machine code in the memory and executes it immediately, but it saves the generated native machine code into compiledirs. The reason for doing it that way is quite practical like the docs explain, the persistent cache on disk makes it possible to avoid generating code for the same operation, and to avoid compiling again when different operations generate the same code. The compiledirs by default are located within $(HOME)/.theano/. After some time the folder becomes quite large, and might look something like this:
$ ls ~/.theano
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.11+-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.12-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.12rc1-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--3.5.1+-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--3.5.2-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--3.5.2rc1-64
If the used Python version changed like in this example you might to want to purge obsolete cache. For working with the cache resp. the compiledirs, the helper theano-cache comes in handy. If you invoke it without any arguments the current cache location is put out like ~/.theano/compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.12-64 (the script is run from /usr/share/python-theano). So, the compiledirs for the old Python versions in this example (11+ and 12rc1) can be removed to free the space they occupy. All compiledirs resp. cache directories meaning the whole cache could be erased by $ theano-cache basecompiledir purge, the effect is the same as by performing $ rm -rf ~/.theano. You might want to do that e.g. if you're using different hardware, like when you got yourself another graphics card. Or habitual from time to time when the compiledirs fill up so much that it slows down processing with the harddisk being very busy all the time, if you don't have an SSD drive available. For example, the disk space of build chroots carrying (mainly) the tests completely compiled through on default Python 2 and Python 3 consumes around 1.3 GB (see here). BLAS implementations Theano needs a level 3 implementation of BLAS (Basic Linear Algebra Subprograms) for operations between vectors (one-dimensional mathematical objects) and matrices (two-dimensional objects) carried out on the CPU. NumPy is already build on BLAS and pulls the standard implementation (libblas3, soure package: lapack), but Theano links directly to it instead of using NumPy as intermediate layer to reduce the computational overhead. For this, Theano needs development headers and the binary packages pull libblas-dev by default, if any other development package of another BLAS implementation (like OpenBLAS or ATLAS) isn't already installed, or pulled with them (providing the virtual package libblas.so). The linker flags could be manipulated directly through the configuration switch config.blas.ldflags, which is by default set to -L/usr/lib -lblas -lblas. By the way, if you set it to an empty value, Theano falls back to using BLAS through NumPy, if you want to have that for some reason. On Debian, there is a very convenient way to switch between BLAS implementations by the alternatives mechanism. If you have several alternative implementations installed at the same time, you can switch from one to another easily by just doing:
$ sudo update-alternatives --config libblas.so
There are 3 choices for the alternative libblas.so (providing /usr/lib/libblas.so).
  Selection    Path                                  Priority   Status
------------------------------------------------------------
* 0            /usr/lib/openblas-base/libblas.so      40        auto mode
  1            /usr/lib/atlas-base/atlas/libblas.so   35        manual mode
  2            /usr/lib/libblas/libblas.so            10        manual mode
  3            /usr/lib/openblas-base/libblas.so      40        manual mode
Press <enter> to keep the current choice[*], or type selection number:
The implementations are performing differently on different hardware, so you might want to take the time to compare which one does it best on your processor (the other packages are libatlas-base-dev and libopenblas-dev), and choose that to optimize your system. If you want to squeeze out all which is in there for carrying out Theano's computations on the CPU, another option is to compile an optimized version of a BLAS library especially for your processor. I'm going to write another blog posting on this issue. The binary packages of Theano ship the script check_blas.py to check over how well a BLAS implementation performs with it, and if everything works right. That script is located in the misc subfolder of the library, you could locate it by doing $ dpkg -L python-theano grep check_blas (or for the package python3-theano accordingly), and run it with the Python interpreter. By default the scripts puts out a lot of info like a huge perfomance comparison reference table, the current setting of blas.ldflags, the compiledir, the setting of floatX, OS information, the GCC version, the current NumPy config towards BLAS, NumPy location and version, if Theano linked directly or has used the NumPy binding, and finally and most important, the execution time. If just the execution time for quick perfomance comparisons is needed this script could be invoked with -q. Theano on CUDA The function compiler of Theano works with alternative backends to carry out the computations, like the ones for graphics cards. Currently, there are two different backends for GPU processing available, one docks onto NVIDIA's CUDA (Compute Unified Device Architecture) technology7, and another one for libgpuarray, which is also developed by the Theano developers in parallel. The libgpuarray library is an interesting alternative for Theano, it's a GPU tensor (multi-dimensional mathematical object) array written in C with Python bindings based on Cython, which has the advantage of running also on OpenCL8. OpenCL, unlike CUDA9, is full free software, vendor neutral and overcomes the limitation of the CUDA toolkit being only available for amd64 and the ppc64el port (see here). I've opened an ITP on libgpuarray and we'll see if and how this works out. Another reason for it would be great to have it available is that it looks like CUDA currently runs into problems with GCC 610. More on that, soon. Here's a litle checklist for setting up your CUDA device so that you don't have to experience something like this:
$ THEANO_FLAGS=device=gpu,floatX=float32 python ./cat_dog_classifier.py 
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: Unable to get the number of gpus available: no CUDA-capable device is detected)
hardware check For running Theano on CUDA you need an NVIDIA graphics card which is capable of doing that. You can recheck if your device is supported by CUDA here. When the hardware isn't too old (CUDA support started with GeForce 8 and Quadro X series) or too strange I think it isn't working only in exceptional cases. You can check your model and if the device is present in the system on the bare hardware level by doing this:
$ lspci   grep -i nvidia
04:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940M] (rev a2)
If a line like this doesn't get returned, your device most probably is broken, or not properly connected (ouch). If rev ff appears at the end of the line that means the device is off meaning powered down. This might be happening if you have a laptop with Optimus graphics hardware, and the related drivers have switched off the unoccupied device to safe energy11. kernel module Running CUDA applications requires the proprietary NVIDIA driver kernel module to be loaded into the kernel and working. If you haven't already installed it for another purpose, the NVIDIA driver and the CUDA toolkit are both in the non-free section of the Debian archive, which is not enabled by default. To get non-free packages you have to add non-free (and it's better to do so, also contrib) to your package source in /etc/apt/sources.list, which might then look like this:
deb http://httpredir.debian.org/debian/ testing main contrib non-free
After doing that, perform $ apt-cache update to update the package lists, and there you go with the non-free packages. The headers of the running kernel are needed to compile modules, you can get them together with the NVIDIA kernel module package by running:
$ sudo apt-get install linux-headers-$(uname -r) nvidia-kernel-dkms build-essential
DKMS will then build the NVIDIA module for the kernel and does some other things on the system. When the installation has finished, it's generally advised to reboot the system completely. troubleshooting If you have problems with the CUDA device, it's advised to verify if the following things concerning the NVIDIA driver resp. kernel module are in order: blacklist nouveau Check if the default Nouveau kernel module driver (which blocks the NVIDIA module) for some reason still gets loaded by doing $ lsmod grep nouveau. If nothing gets returned, that's right. If it's still in the kernel, just add blacklist nouveau to /etc/modprobe.d/blacklist.conf, and update the booting ramdisk with sudo update-initramfs -u afterwards. Then reboot once more, this shouldn't be the case then anymore. rebuild kernel module To fix it when the module haven't been properly compiled for some reason you could trigger a rebuild of the NVIDIA kernel module with $ sudo dpkg-reconfigure nvidia-kernel-dkms. When you're about to send your hardware in to repair because everything looks all right but the device just isn't working, that really could help (own experience). After the rebuild of the module or modules (if you have a few kernel packages installed) has completed, you could recheck if the module really is available by running:
$ sudo modinfo nvidia-current
filename:       /lib/modules/4.4.0-1-amd64/updates/dkms/nvidia-current.ko
alias:          char-major-195-*
version:        352.79
supported:      external
license:        NVIDIA
alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        drm
vermagic:       4.4.0-1-amd64 SMP mod_unload modversions 
parm:           NVreg_Mobile:int
It should be something similiar to this when everything is all right. reload kernel module When there are problems with the GPU, maybe the kernel module isn't properly loaded. You could recheck if the module has been properly loaded by doing
$ lsmod   grep nvidia
nvidia_uvm             73728  0
nvidia               8540160  1 nvidia_uvm
drm                   356352  7 i915,drm_kms_helper,nvidia
The kernel module could be loaded resp. reloaded with $ sudo nvidia-modprobe (that tool is from the package nvidia-modprobe). unsupported graphics card Be sure that you graphics cards is supported by the current driver kernel module. If you have bought new hardware, that's quite possible to come out being a problem. You can get the version of the current NVIDIA driver with:
$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.79  Wed Jan 13 16:17:53 PST 2016
GCC version:  gcc version 5.3.1 20160528 (Debian 5.3.1-21)
Then, google the version number like nvidia 352.79, this should get you onto an official driver download page like this. There, check for what's to be found under "Supported Products". I you're stuck with that there are two options, to wait until the driver in Debian got updated, or replace it with the latest driver package from NVIDIA. That's possible to do, but something more for experienced users. occupied graphics card The CUDA driver cannot work while the graphical interface is busy like by processing the graphical display of your X.Org server. Which kernel driver actually is used to process the desktop could be examined by this command:12
$ grep '(II).*([0-9]):' /var/log/Xorg.0.log
[    37.700] (II) intel(0): Using Kernel Mode Setting driver: i915, version 1.6.0 20150522
[    37.700] (II) intel(0): SNA compiled: xserver-xorg-video-intel 2:2.99.917-2 (Vincent Cheng <vcheng@debian.org>)
 ... 
[    39.808] (II) intel(0): switch to mode 1920x1080@60.0 on eDP1 using pipe 0, position (0, 0), rotation normal, reflection none
[    39.810] (II) intel(0): Setting screen physical size to 508 x 285
[    67.576] (II) intel(0): EDID vendor "CMN", prod id 5941
[    67.576] (II) intel(0): Printing DDC gathered Modelines:
[    67.576] (II) intel(0): Modeline "1920x1080"x0.0  152.84  1920 1968 2000 2250  1080 1083 1088 1132 -hsync -vsync (67.9 kHz eP)
This example shows that the rendering of the desktop is performed by the graphical device of the Intel CPU, which is just like it's needed for running CUDA applications on your NVIDIA graphics card, if you don't have another one. nvidia-cuda-toolkit With the Debian package of the CUDA toolkit everything pretty much runs out of the box for Theano. Just install it with apt-get, and you're ready to go, the CUDA backend is the default one. Pycuda is also a suggested dependency of the binary packages, it could be pulled together with the CUDA toolkit. The up-to-date CUDA release 7.5 is of course available, with that you have Maxwell architecture support so that you can run Theano on e.g. a GeForce GTX Titan X with 6,2 TFLOPS on single precision13 at an affordable price. CUDA 814 is around the corner with support for the new Pascal architecture15. Like the GeForce GTX 1080 high-end gaming graphics card already has 8,23 TFLOPS16. When it comes to professional GPGPU hardware like the Tesla P100 there is much more computational power available, scalable by multiplication of cores resp. cards up to genuine little supercomputers which fit on a desk, like the DGX-117. Theano can use multiple GPUs for calculations to work with highly scaled hardware, I'll write another blog post on this issue. Theano on the GPU It's not difficult to run Theano on the GPU. Only single precision floating point numbers (float32) are supported on the GPU, but that is sufficient for deep learning applications. Theano uses double precision floats (float64) by default, so you have to set the configuration variable config.floatX to float32, like written on above, either with the THEANO_FLAGS environment variable or better in your .theanorc file, if you're going to use the GPU a lot. Switching to the GPU actually happens with the config.device configuration variable, which must be set to either gpu or gpu0, gpu1 etc., to choose a particular one if multiple devices are available. Here's is a little test script check1.py, it's taken from the docs and slightly altered. You can run that script either with python or python3 (there was a single test failure on the Python 3 package, so the Python 2 library might be a little more stable currently). For comparison, here's an example on how it perfoms on my hardware, one time on the CPU, one more time on the GPU:
$ THEANO_FLAGS=floatX=float32 python ./check1.py 
[Elemwise exp,no_inplace (<TensorType(float32, vector)>)]
Looping 1000 times took 4.481719 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu
$ THEANO_FLAGS=floatX=float32,device=gpu python ./check1.py 
Using gpu device 0: GeForce 940M (CNMeM is disabled, cuDNN not available)
[GpuElemwise exp,no_inplace (<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise exp,no_inplace .0)]
Looping 1000 times took 1.164906 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu
If you got a result like this you're ready to go with Theano on Debian, training computer vision classifiers or whatever you want to do with it. I'll write more on for what Theano could be used, soon.

  1. Some ports are disabled because they are currently not supported by Theano. There are NotImplementedErrors and other errors in the tests on the numpy.ndarray object being not aligned. The developers commented on that, see here. And on some ports the build flags -m32 resp. -m64 of Theano aren't supported by g++, the build flags can't be manipulated easily.
  2. Theano Development Team: "Theano: a Python framework for fast computation of mathematical expressions"
  3. Marc Couture: "Today's high-powered GPUs: strong for graphics and for maths". In: RTC magazine June 2015, pp. 22 25
  4. Ogier Maitre: "Understanding NVIDIA GPGPU hardware". In: Tsutsui/Collet (eds.): Massively parallel evolutionary computation on GPGPUs. Berlin, Heidelberg: Springer 2013, pp. 15-34
  5. Geoffrey French: "Deep learing tutorial: advanved techniques". PyData London 2016 presentation
  6. Like the description of the Lintian tag binary-without-manpage says, that's not needed for them being in /usr/share.
  7. Tom. R. Halfhill: "Parallel processing with CUDA: Nvidia's high-performance computing platform uses massive multithreading". In: Microprocessor Report January 28, 2008
  8. Faber et.al: "Parallelwelten: GPU-Programmierung mit OpenCL". In: C't 26/2014, pp. 160-165
  9. For comparison, see: Valentine Sinitsyn: "Feel the taste of GPU programming". In: Linux Voice February 2015, pp. 106-109
  10. https://lists.debian.org/debian-devel/2016/07/msg00004.html
  11. If Optimus (hybrid) graphics hardware is present (like commonly today on PC laptops), Debian launches the X-server on the graphics processing unit of the CPU, which is ideal for CUDA. The problem with Optimus actually is the graphics processing on the dedicated GPU. If you are using Bumblebee, the Python interpreter which you want to run Theano on has be to be started with the launcher primusrun, because Bumblebee powers the GPU down with the tool bbswitch every time it isn't used, and I think also the kernel module of the driver is dynamically loaded.
  12. Thorsten Leemhuis: "Treiberreviere. Probleme mit Grafiktreibern f r Linux l sen": In: C't Nr.2/2013, pp. 156-161
  13. Martin Fischer: "4K-Rakete: Die schnellste Single-GPU-Grafikkarte der Welt". In C't 13/2015, pp. 60-61
  14. http://www.heise.de/developer/meldung/Nvidia-CUDA-8-bringt-Optimierungen-fuer-die-Pascal-Architektur-3164254.html
  15. Martin Fischer: "All In: Nvidia enth llt die GPU-Architektur 'Pascal'". In: C't 9/2016, pp. 30-31
  16. Martin Fischer: "Turbo-Pascal: High-End-Grafikkarte f r Spieler: GeForce GTX 1080". In: C't 13/2016, pp. 100-103
  17. http://www.golem.de/news/dgx-1-nvidias-supercomputerchen-mit-8x-tesla-p100-1604-120155.html

3 July 2016

Daniel Stender: My work for Debian in June

At least a little much more time lst month for helping to improve the operating system we're working on. I've worked on a couple of package updates, mostly they are for PAPT, DPMT, Debian Science, or pkg-go (in order of completion): New packages: Sponsored Uploads: Again some further ahead with these ones. My "new package of the month" is going to be Theano. I'm still working on the introduction, it'll come hereafter very soon.

30 May 2016

Daniel Stender: My work for Debian in May

No double posting this time ;-) I've got not so much spare time this month to spend on Debian, but I could work on the following packages: This series of blog postings also includes little introductions of and into new packages in the archive. This month there is: Pyinfra Pyinfra is a new project which is currently still in development state. It has been already pointed out in an interesting German article1, and is now available as package maintained within the Python Applications Team. It's currently a one man production by Nick Barrett, and eagerly developed in the past weeks (we're currently at 0.1~dev24). Pyinfra is a remote server configuration/provisioning/service deployment tool which belongs in the same software category like Puppet or Ansible2. It's for provisioning one or an array of remote servers with software packages and to configure them. Pyinfra runs agentless like Ansible, that means for using it nothing special (like a daemon) has to run on targeted servers. It's written to be used for provisioning POSIX compatible Linux systems and has alternatives when it comes to special features like package managers (e.g. supports apt as well as yum). The documentation could be found in usr/share/doc/pyinfra/html/. Here's a little crash course on how to use Pyinfra: The pyinfra CLI tool is used on the command line like this, deploy scripts, single operations or facts (see below) could be used on a single server or a multitude of remote servers:
$ pyinfra -i <inventory script/single host> <deploy script>
$ pyinfra -i <inventory script/single host> --run <operation>
$ pyinfra -i <inventory script/single host> --facts <fact>
Remote servers which are operated on must provide a working shell and they must be reachable by SSH. For connecting, --port, --user, --password, --key/--key-password and --sudo flags are available, --sudo to gain superuser rights. Root access or sudo rights of course have to be already set up. By the way, localhost could be operated on the same way. Single operations are organized in modules like "apt", "files", "init", "server" etc. With the --run option they could be used individually on servers like follows, e.g. server.user adds a new user on a single targeted system (-v adds verbosity to the pyinfra run):
$ pyinfra -i 192.0.2.10 --run server.user sam --user root --key ~/.ssh/sshkey --key-password 123456 -v
Multiple servers can be grouped in inventories, which hold the targeted hosts and data associated with them, like e.g. an inventory file farm1.py would contain lists like this:
COMPUTE_SERVERS = ['192.0.2.10', '192.0.2.11']
DATABASE_SERVERS = ['192.0.2.20', '192.0.2.21']
Group designators must be all caps. A higher level of grouping are the file names of inventory scripts, thus COMPUTE_SERVERS and DATABASE_SERVERS can be referenced to at the same time by the group designator farm1. Plus, all servers are automatically added to the group all. And, inventory scripts should be stored in the subfolder inventory/ in the project directory. Inventory files then could be used instead of specific IP addresses like this, the single operation then gets performed on all given machines in farm1.py:
$ pyinfra -i inventory/farm1.py  --run server.user sam --user root --key ~/.ssh/sshkey --key-password=123456 -v
Deployment scripts could be used together with group data files in the subfolder group_data/ in the project directory. For example, a group_data/farm1.py designates all servers given in inventory/farm1.py (by the way, all.py designates all servers), and contains the random attribute user_name (attributes must be lowercase), next to authentication data for the whole inventory group:
user_name = 'sam'
ssh_user = 'root'
ssh_key = '~/.ssh/sshkey'
ssh_key_password = '123456'
The random attribute can be picked up by a deployment script using host.data() like follows, user_name could be used again for e.g. server.user(), like this:
from pyinfra import host
from pyinfra.modules import server
server.user(host.data.user_name)
This deploy, the ensemble of inventory file, group data file and deployment script (usually placed top level in the project folder) then could be run that way:
$ pyinfra -i inventory/farm1.py deploy.py
You have guessed it, since deployment scripts are Python scripts they are fully programmable (please regard that Pyinfra is build & runs on Python 3 on Debian), and that's the main advantage point with this piece of software. Quite handy for that come Pyinfra facts, functions which check different things on remote systems and return information as Python data. Like e.g. deb_packages returns a dictionary of installed packages from a remote apt based server:
$ pyinfra -i 192.0.2.10 --fact deb_packages --user root --key ~/.ssh/sshkey --key-password=123456
 
    "192.0.2.10":  
        "libdebconfclient0": "0.192",
        "python-debian": "0.1.27",
        "libavahi-client3": "0.6.31-5",
        "dbus": "1.8.20-0+deb8u1",
        "libustr-1.0-1": "1.0.4-3+b2",
        "sed": "4.2.2-4+b1",
Using facts, Pyinfra reveals its full potential. For example, a deployment script could go like this, linux.distribution() returns a dict containing the installed distribution:
from pyinfra import host
from pyinfra.modules import apt
if host.fact.linux_distribution['name'] == 'Debian':
   apt.packages(packages='gummi', present=True, update=True)
elif host.fact.linux_distribution['name'] == 'CentOS':
   pass
I'll spare more sophisticated examples to keep this introduction simple. Beyond fancy deployment scripts, Pyinfra features an own API by which it could be programmed from the outside, and much more. But maybe that's enough to introduce Pyinfra. That are the usage basics. Pyinfra is a brand new project and it remains to be seen whether the developer can keep on further developing the tool like he does these days. For a private project it's insane to attempt to become a contender for the established "big" free configuration management tools and frameworks, but, if Puppet has become too complex in the meanwhile or not3, I really don't think that's the point here. Pyinfra follows an own approach in being programmable the way which it is. And it's definitely not harm to have it in the toolbox already, not trying to replace nothing. Brainstorm After the first package has been in experimental, the Brainstorm library from Swiss AI research institute IDSIA4 is now available as python3-brainstorm in unstable. Brainstorm is a lean, easy-to-use library for setting up deep learning networks (multiple layered artificial neural networks) for machine learning applications like for image and speech recognition or natural language processing. To set up a working training network for a classifier for handwritten digits like the MNIST dataset (a usual "hello world") just takes a couple of lines, like an example demonstrates. The package is maintained within the Debian Python Modules Team. The Debian package ships a couple of examples in /usr/share/python3-brainstorm/examples (the data/ and examples/ folders of the upstream tarball are combined here). Among them there are5: The current documentation in /usr/share/doc/python3-brainstorm/html/ isn't complete yet (several chapters are under construction), but there's a walkthrough on the CIFAR-10 example. The MNIST example has been extended by Github user pinae, and has been explained in German C't recently6. What are the perspectives for further development? Like Zhou Mo confirmed, there are a couple of deep learning frameworks around having a rather poor outlook since there have been abandoned after being completed as PhD projects. There's really no point for thriving to have them all in Debian, like the ITP of Minerva has been given up partly for this reason, there weren't any commits since 08/2015 (and because cuDNN isn't available and most likely won't). Brainstorm, 0.5 have been released 05/2015, also was a PhD project as IDSIA. It's stated on Github that the project is "under active development", but the rather sparse project page on the other side expresses the "hope the community will help us to further improve Brainstorm". This sentence much often implies that the developers are not actively working on the project. But there are recent commits and it looks that upstream is active and could be reached when there are problems, and that the project is active. So I don't think we're riding a dead horse, here. The downside for Brainstorm in Debian is, it seems that the libraries which are needed for GPU accelerated processing can't be fully provided. Pycuda is available, but scikit-cuda (an additional library which provides wrappers for CUDA features like CUBLAS, CUFFT and CUSOLVER) is not and won't be, because the CULA Dense Toolkit (scikit-cuda also contains wrappers for also that) is not available freely as source. Because of that, a dependency against pycuda, not even as Suggests (it's non-free), has been spared. Without GPU acceleration, Brainstorm computes the matrices on openBLAS using a Cython wrapper on the NumpyHandler, and the PyCudaHandler couldn't be used. openBLAS makes pretty good use of the available hardware (it distributes over all available CPU cores), but it's not yet possible to run Brainstorm full throttle using available floating point devices to reduce training times, which becomes crucial when the projects are getting bigger. Brainstorm belongs to the number of deep learning frameworks already being or becoming available in Debian. Currently there is: I've checked over Microsoft's CNTK, but although it's also set free recently I have my doubts if that could be included. Apparently there are dependencies against non-free software and most likely other issues. So much for a little update on the state of deep learning in Debian, please excuse if my radar misses something.

  1. Tim Sch rmann: "Schlangen l: Automatisiertes Service-Deployment mit Pyinfra". In: IT-Administrator 05/2016, pp. 90-95.
  2. For a comparison of configuration management software like this, see B wetter/Johannsen/Steig: "Baukastensysteme: Konfigurationsmanagement mit Open-Source-Software". In: iX 04/2016, pp. 94-99 (please excuse the prevalence of German articles in the pointers, I've just have them at hand).
  3. On the points of critique on Puppet, see Martin Loschwitz: "David gegen Goliath Zwei Welten treffen aufeinander: Puppet und Ansible". In Linux-Magazin 01/2016, 50-54.
  4. See the interview with IDSIA's deep learning guru J rgen Schmidhuber in German C't 2014/09, p. 148
  5. The examples scripts need some more finetuning. To run the data creation scripts in place the environment variable BRAINSTORM_DATA_DIR could be set, but the trained networks are currently tried to write in place. So please copy the scripts into some workspace if you want to try them out. I'll patch the example scripts to run out-of-the-box, soon.
  6. Johannes Merkert: "Ziffernlerner. Ein k nstliches neuronales Netz selber gebaut". In: C't 2016/06, p. 142-147. Web: http://www.heise.de/ct/ausgabe/2016-6-Ein-kuenstliches-neuronales-Netz-selbst-gebaut-3118857.html.
  7. See Ramon Wartala: "Tiefensch rfe: Deep learning mit NVIDIAs Jetson-TX1-Board und dem Caffe-Framework". In: iX 06/2016, pp. 100-103
  8. https://lists.debian.org/debian-science/2016/03/msg00016.html

30 April 2016

Daniel Stender: What I've worked on for Debian this month

This month I've worked on the following things for Debian: To begin with that, I've set up a Debhelper sequencer script for dh-buildinfo1, this add-on now can be used with dh $@ --with buildinfo in deb/rules instead of having to explicitly call it somewhere in an override. Debops I've set up initial Debian packages of Debops2, a collection of fine crafted Ansible roles and playbooks especially for Debian servers, shipped with a couple of convenience and wrapper scripts in Python3. There are two binary packages, one for the toolset (debops), and the other for the playbooks and roles of the project (debops-playbooks). The application is easy to use, just initialize a new project with debops-init foo and add your server(s) to foo/ansible/inventory/hosts belonging to groups representing services and things you want to employ on them. Like the group [debops_gitlab] automatically installs a complete running Gitlab setup on one or a multitude of servers in the same run with the debops command4. Use other groups like [debops_mariadb_server] accordingly in the same host inventory. Ansible runs agent less, so you don't have to prepare freshly setup servers with nothing special to use that tool randomly (like on localhost). The list of things you could deploy with Debops is quite amazing and you've got dozens of services at your hand. The new packages are currently in experimental because they need some more fine tuning, like there are a couple of minor error messages which recently occur using it, but it works well. The (early staged) documentation unfortunately couldn't be packaged because of the scattered resp. collective nature of the project (all parts have their own Github repositories)5, and also how to generate the upstream tarball remains a bit of a challenge (currently, it's the outcome of debops-init)6. I'll have this package in unstable soon. More info on Debops is coming up, then. Hashicorp's Packer I'm very glad to announce that Packer7 is ready being available in unstable, and the two year old RFP bug could be finally closed8. It's another great and much convenient devops tool which does a lot of different things in an automated fashion using only a single "one-argument" CLI tool in combination with a couple of lines in a configuration script (thanks to Yaroslav Halchenko for the tip). Packer helps creating machine images for different platforms. This is like when you use e.g. Debian installations in a Qemu box for testing or development purposes. Instead of setting up a new virtual machine manually like installing Debian on another computer this process could be automated with Packer, like I've written about in this blog entry here9. You just need a template containing instructions for the included Qemu-builder and a preseeding script for the Debian installer, and there you go drinking your coffee while Packer does all the work for you: downloading the installation ISO image, creating the new virtual harddrive, booting the emulator, running the whole installation process automatically like answering questions, selecting things, rebooting without ISO image to complete the installation etc. A couple of minutes and you have a new pre-baked virtual machine image like from a vendoring machine, a fresh one everytime you need it. Packer10 supports a number of builders for different target platforms (desktop virtualization solutions as much as public cloud providers and private cloud software), can build in parallel, and also the full range of common provisioners can be employed in the process to equip the newly installed OSs. Vagrant boxes could be generated by one of the included postprocessors. I'll write more on Packer here on this blog, soon. There were more then two dozens of packages missing to complete Packer11, which is the achievement of combined forces within the pkg-go group. Much thanks esp. to Alexandre Viau who have worked on the most of the needed new packages. Thanks also to the FTP-masters which were always very quick in reviewing the Go packages, so that it could be proceeded to build and package the sub dependent new ones always consecutively. Squirrel3 I've didn't had the most work with it and just sponsored this for Fabian Wolff, but want to highlight here that there's a new package of Squirrel12 now available in Debian13. Squirrel is a lightweight scripting language, somewhat comparable to Lua. It's fully object-oriented and highly embeddable, it's used in a lot of commerical computer games under the hood for implementing intelligence for bots next to other things14, but also for the Internet of Things (it's embedded in hardware from Electric Imp). Squirrel functions could be called from C++15. I've filed an ITP bug for Squirrel already in 2011 (#651195), but always something else got in the way, and it ended up being an RFP. I'm really glad that it got picked up and completed. misc There were a couple of uploads on updated upstream tarballs and for fixing bugs, namely afl/2.10b-1 and 2.11b-1, python-afl/0.5.3-1, pyutilib/5.3.2-1, pyomo/4.3.11327-1, libvigraimpex/1.10.0+git20160211.167be93dfsg-2 (fix of #820429, thanks for Tobias Frost), and gamera/3.4.2+svn1454-1. For the pkg-go group, I've set up a new package of github-mitchellh-ioprogress (which is needed by the official DigitalOcean CLI tool doctl, now RFP #807956 instead of ITP due to the lack of time - again facing a lot of missing packages), and provided a little patch for dh-make-golang updating some standards16. For Packer I've also updated azure-go-autorest and azure-sdk as team upload (#821938, #821832), but it came out that the project which is currently under heavy development towards a new official release broke a lot in the past weeks (and no Git branching have been used), so that Packer as a matter of fact needed a vendored snapshot, although there have been only a couple of commits in between. Docker-registry hat the same problem with the new package of azure-sdk/2.1.1~beta1, so that it needed to be fixed, too (#822146). By the way, the tool ratt17 comes very handy for automatically test building down all reverse dependencies, not only for Go packages (thanks to Tianon Gravi for the tip). Finally, I've posted the needed reverse depencies as RFP bugs for Terraform18 (again quite a lot), Vuls19, and cve-dictionary20, which is needed for Vuls. I'll let them rest a while waiting to get picked up before working anything down.

Daniel Stender: My work for Debian in April

This month I've worked on the these things for Debian: To begin with that, I've set up a Debhelper sequencer script for dh-buildinfo1, this add-on now can be used with dh $@ --with buildinfo in deb/rules instead of having to explicitly call it somewhere in an override. Debops I've set up initial Debian packages of Debops2, a collection of fine crafted Ansible roles and playbooks especially for Debian servers (servers which run on Debian), which are shipped with a couple of helper and wrapper scripts in Python3. There are two binary packages, one for the toolset (debops), and the other for the playbooks and roles of the project (debops-playbooks). The application is easy to use, just initialize a new project with debops-init foo and add your server(s) to foo/ansible/inventory/hosts belonging to groups representing services and things you want to employ on them. For example, the group [debops_gitlab] automatically installs a complete running Gitlab setup on one or a multitude of servers in the same run with the debops command4. Other groups like [debops_mariadb_server] could be used accordingly in the same host inventory. Ansible works without agent, so you don't have to prepare freshly setup servers with nothing special to use that tool randomly (like on localhost). The list of things you could deploy with Debops is quite amazing and dozens of services are at hand. The new Debian packages are currently in experimental because they need some more fine tuning, e.g. there are a couple of minor error messages which recently occur using it, but it works well. The (early staged) documentation unfortunately couldn't be packaged because of the scattered resp. collective nature of the project (all parts have their own Github repositories)5, and also how to generate the upstream tarball remains a bit of a challenge (currently, it's the outcome of debops-init)6. I'll have this package in unstable soon. More info on Debops is coming up, then. HashiCorp's Packer I'm very glad to announce that Packer7 is ready being available in unstable, and the RFP bug could be finally closed after I've taken it over8. It's another great and much convenient devops tool which does a lot of different things in an automated fashion using only a single "one-argument" CLI tool in combination with a couple of lines in a configuration script (thanks to Yaroslav Halchenko for the tip). Packer helps creating machine images for different platforms. This is like when you use e.g. Debian installations in a Qemu box for testing or development purposes. Instead of setting up a new virtual machine manually the same way as installing Debian on another computer this process can be completely automated with Packer, like I've written about in this blog entry here9. You just need a template which contains instructions for the included Qemu builder and a preseeding script for the Debian installer, and there you go drinking your coffee while Packer does all the work: download the ISO image for installation, create the new virtual harddrive, boot the emulator, run the whole installation process automatically like with answering questions, selecting things, reboot without ISO image to complete the installation etc. A couple of minutes and you have a new pre-baked virtual machine image like from a vendoring machine, another fresh one could be created anytime. Packer10 supports a number of builders for different target platforms (desktop virtualization solutions as much as public cloud providers and private cloud software), can build in parallel, and also the full range of common provisioners can be employed in the process to equip the newly installed OSs with services and programs. Vagrant boxes could be generated by one of the included postprocessors. I'll write more on Packer here on this blog, soon. There were more then two dozens of packages missing to complete Packer11, which is the achievement of combined forces within the pkg-go group. Much thanks esp. to Alexandre Viau who have worked on the most of the needed new packages. Thanks also to the FTP masters which were always very quick in reviewing the Go packages, so that it could be proceeded to build and package the sub dependent new ones always consecutively. Squirrel3 I've didn't had the major work of that and just sponsored this for Fabian Wolff, but want to highlight here that there's a new package of Squirrel12 now available in Debian13. Squirrel is a lightweight scripting language, somewhat comparable to Lua. It's fully object-oriented and highly embeddable, it's used in a lot of commerical computer games under the hood for implementing intelligence for bots next to other things14, but also for the Internet of Things (it's embedded in hardware from Electric Imp). Squirrel functions could be called from C++15. I've filed an ITP bug for Squirrel already in 2011 (#651195), but always something else had a higher priority, and it ended up being an RFP. I'm really glad that it got picked up and completed quickly afterwards. misc There were a couple of uploads on updated upstream tarballs and for fixing bugs, namely afl/2.10b-1 and 2.11b-1, python-afl/0.5.3-1, pyutilib/5.3.2-1, pyomo/4.3.11327-1, libvigraimpex/1.10.0+git20160211.167be93dfsg-2 (fix of #820429, thanks to Tobias Frost), and gamera/3.4.2+svn1454-1. For the pkg-go group, I've set up a new package of github-mitchellh-ioprogress (which is needed by the official DigitalOcean CLI tool doctl, now RFP #807956 instead of ITP due to the lack of time, again a lot of missing packages are missing for that), and provided a little patch for dh-make-golang updating some standards16. For Packer I've also updated azure-go-autorest and azure-sdk as team upload (#821938, #821832), but it came out that the project which is currently under heavy development towards a new official release broke a lot in the past weeks (no Git branching have been used), so that Packer as a matter of fact needed a vendored snapshot, although there have been only a couple of commits in between. Docker-registry has the same problem with the new package of azure-sdk/2.1.1~beta1, so that it needed to be fixed, too (#822146). By the way, the tool ratt17 comes very handy for automatically test building down all reverse dependencies, not only for Go packages (thanks to Tianon Gravi for the tip). Finally, I've posted the needed reverse depencies as RFP bugs for Terraform18 (again quite a lot), Vuls19, and cve-dictionary20, which is needed for Vuls. I'll let them rest a while waiting to get picked up before working anything down.

19 August 2015

Patrick Schoenfeld: aptituz/ssh 2.3.2 published

I ve just uploaded an update version of my puppet ssh module to the forge. The module aims at being a generic module to manage of ssh server and clients, including key generation and known_hosts management. It provides a mechanism to generate and deploy ssh keys without the need of storeconfig or PuppetDB but a server-side cache instead. This is neat, if you want to remain ssh keys during a reprovisioning of a host. Updates The update is mostly to push out some patches I ve received from contributors via pull requests in the last few months. It adds:

Next.