Search Results: "Johannes Schauer"

08 August 2012

Johannes Schauer: Bootstrappable Debian - How to help

TLDR: multiarch, multiarch, multiarch, cross buildability, staged build dependencies, wiki page, corrections/hints/requests to debian-bootstrap at lists.mister-muffin.de This summer (and this year's GSoC) is nearing its end and to make it easier for people to make use of the information my tools produced so far, I created a page in the Debian wiki. It lists not only the open issues I see but also statistics that I gathered using the output of my GSoC project. I want to use this blog post to make people aware of that page as well as to get some feedback on it and anything related to it. The biggest blocker my tools face, is that many packages are still missing multiarch information. As long as at least the basic packages do not have their cross build dependencies satisfied via multiarch for an existing foreign architecture, automated tools can not properly analyze the dependency situation in the bootstrapping case, when many packages of the new foreign architecture do not even exist yet. If Debian is supposed to be bootstrappable, then the first stage is to make a set of basic packages cross compile for an existing foreign architecture. Once this is possible, a tool of mine can analyze the cyclic build dependency situation that might occur when cross compiling for an architecture that does not exist yet. Then, staged cross builds can be used to cross compile a minimal foreign system. Due to missing multiarch classification, it is not known yet how big the cyclic build dependency situation is for the base packages. It is not only the conversion of packages to multiarch that is needed but also the adding of the :any (and rare cases :native) qualifier to build dependencies on M-A: allowed packages. Prominent build dependencies that should (but are not yet) be M-A: allowed are python and gettext. Both are needed as a build dependency by many packages of the base system. Unfortunately wanna-build does not understand qualifiers like :any and :native yet. Until it does, no package can be marked :any or :native and cross compilation of many base packages can not succeed. Once the point is reached, where a base system can be cross compiled from nothing, native compilation can start. Since native compilation doesnt depend on multiarch, the dependency situation when trying to natively compiling all of Debian from nothing is understood much better. Unfortunately, the cyclic build dependency situation is also much worse in the native case and there exists a big 1000 node strongly connected component of binary and source packages that all interdepend on each other. This dependency mess can be solved using three approaches: The wiki page gives many hints on how to find packages that each method can be applied to. Stage building is a tool that might be useful for cross building (we dont know for sure yet) but is definitely needed for native compilation. It is needed for native compilation because after all possible dependencies are moved to Build-Depends-Indep, the only other alternative to stage building for breaking dependency cycles is to cross build source packages. Since building a package without one of its build dependencies "staged" is often much easier than making the package in question cross compile, it is a preferred alternative. Once more packages have been made multiarch, it might be possible to prove that there is no alternative to introducing a notion of staged builds. Some people (wookey, Patrick McDermott, Guillem Jover, myself) decided that the following format to mark staged build dependencies would be preferred over others:
Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny
The <> format was proposed by Guillem Jover in bug#661538. Patches for dpkg and dose3 are done. More people need to discuss about this format for a final decision on how to indicate staged build dependencies. For more information on the topic, have a look at the corresponding wiki page. Feel free to direct any comments/critique/hints to debian-bootstrap at lists.mister-muffin.de or directly to me.

30 July 2012

Johannes Schauer: port bootstrap build-ordering tool report 4

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de.

Diary

July 2
  • playing around with syntactic dependency graphs and how to use them to flatten dependencies

July 4
  • make work with dose 3.0.2
  • add linux-amd64 to source architectures
  • remove printing in build_compile_rounds
  • catch Not_found exception and print warning
  • use the whole installation set in crosseverything.ml instead of flattened dependencies
  • detect infinite loop and quit in crosseverything.ml
  • use globbing in _tags file
  • use wildcards and patsubst in makefile

July 5
  • throw a warning if there exist binary packages without source packages
  • add string_of_list and string_of_pkglist and adapt print_pkg_list and print_pkg_list_full to use them
  • fix and extend flatten_deps - now also tested with Debian Sid

July 6
  • do not exclude the crosscompiled packages from being compiled in crosseverything.ml
  • clean up basebuildsystem.ml, remove old code, use BootstrapCommon code
  • clean up basenocycles.ml, remove unused code and commented out code
  • add option to print statistics about the generated dependency graph
  • implement most_needed_fast_wrong as well as most_needed_slow_correct and make both available through the menu

July 7
  • allow to investigate all scc, not only the full graph and the scc containing the investigated package
  • handle Not_found in src_list_from_bin_list with warning message
  • handle the event of the whole archive actually being buildable
  • replace raise Failure with failwith
  • handle incorrectly typed package names
  • add first version of reduced_dist.ml to create a self-contained mini distribution out of a big one

July 8
  • add script to quickly check for binary packages without source package
  • make Debian Sid default in makefile
  • add *.d.byte files to .gitignore
  • README is helpful now
  • more pattern matching and recursiveness everywhere

July 9
  • fix termination condition of reduced_dist.ml
  • have precise as default ubuntu distribution
  • do not allow to investigate an already installable package

July 10
  • milestone: show all cycles in a graph
  • add copyright info (LGPL3+)

July 11
  • advice to use dose tools in README

July 16
  • write apt_pkg based python filter script replacing grep-dctrl

July 17
  • use Depsolver.listcheck more often
  • add dist_graph.ml
  • refactor dependency graph code into its own module

July 18
  • improve package selection for reduced_dist.ml
  • improve performance of cycle enumeration code

July 20
  • implement buildprofile support into dose3

July 22
  • let dist_graph.ml use commandline arguments

July 23
  • allow dose3 to generate source package lists without Build- Depends Conflicts -Indep

July 29
  • implement crosscompile support into dose3

Results

Readme There is not yet a writeup on how everything works and how all the pieces of the code work together but the current README file provides a short introduction on how to use the tools.
  • build and runtime dependencies
  • compile instructions
  • execution examples for each program
  • step by step guide how to analyze the dependency situation
  • explanation of general commandline options
A detailed writeup about the inner workings of everything will be part of a final documentation stage.

License All my code is now released under the terms of the LGPL either version 3, or (at your option) any later version. A special linking exception is made to the license which can be read at the top of the provided COPYING file. The exception is necessary because Ocaml links statically, which means that without that exception, the conditions of distribution would basically equal GPL3+.

reduced_dist.ml Especially the Debian archive is huge and one might want to work on a reduced selection of packages first. Having a smaller selection of the archive would be significantly faster and would also not add thousands of packages that are not important for an extended base system. I call a reduced distribution a set of source packages A and a set of binary packages B which fulfill the following three properties:
  • all source packages A must be buildable with only binary packages B being available
  • all binary packages B except for architecture:all packages must be buildable from source packages A
The set of binary packages B and source packages A can be retrieved using the reduced_dist program. It allows to either build the most minimal reduced distribution or one that includes a certain package selection. To filter out the package control stanzas for a reduced distribution from a full distribution, I originally used a call to grep-dctrl but later replaced that by a custom python script called filter-packages.py. This script uses python-apt to filter Packages and Sources files for a certain package selection.

dist_graph.ml It soon became obvious that there were not many independent dependency cycle situation but just one big scc that would contain 96% of the packages that are involved in build dependency cycles. Therefor it made sense to write a program that does not iteratively build the dependency graph starting from a single package, but which builds a dependency graph for a whole archive.

Cycles I can now enumerate all cycles in the dependency graph. I covered the theoretical part in another blog post and wrote an email about the achievement to the list. Both resources contain more links to the respective sourcecode. The dependency graph generated for Debian Sid has 39486 vertices. It has only one central scc with 1027 vertices and only eight other scc with 2 to 7 vertices. All the other source and binary packages in the dependency graph for the archive are degenerate components of length one. Obtaining the attached result took 4 hours on my machine (Core i5 @ 2.53GHz). 1.5 h of that were needed to build the dependency graph, the other 2.5 hours were needed to run johnson's algorithm on the result. Memory consumption of the program was at about 700 MB. It is to my joy that apparently the runtime of the cycle finding algorithm for a whole Debian Sid repository as well as the memory requirements are within orders of magnitude that are justifiable when being run on off-the-shelf hardware. It must also be noted that nothing is optimized for performance yet. A list of all cycles in Debian Sid up to length 4 can be retrieved from this email. This cycle analysis assumes that only essential packages, build-essential and dependencies and debhelper are available. Debhelper is not an essential or build-essential package but 79% of the archive build-depends on it. The most interesting cycles are probably those of length 2 that need packages that they build themselves. Noticeable examples for these situations are vala, python, mlton, fpc, sbcl and ghc. Languages seem love to need themselves to be built.

Buildprofiles There is a long discussion of how to encode staged build dependency information in source packages. While the initial idea was to use Build-Depends-StageN fields, this solution would duplicate large parts of the Build-Depends field, which leads to bitrot as well as it is inflexible to possible other build "profiles". To remedy the situation it was proposed to use field names like Build-Depends[stage1 embedded] but this would also duplicate information and would break with the rfc822 format of package description files. A document maintained by Guillem Jover gives even more ideas and details. Internally, Patrick and me decided for another idea of Guillem Jover to annotate staged build dependencies. The format reads like:
Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny
So each build profile would follow a dependency in <> "brackets" an have a similar format as architecture options. Patrick has a patch for dpkg that implements this functionality while I patched dose3.

Dropping Build-Depends-Indep and Build-Conflicts-Indep When representing the dependencies of a source package, dose3 concatenates its Build-Depends and Build-Depends-Indep dependency information. So up to now, a source package could only be compiled, if it manages to compile all of its binary packages including architecture:all packages. But when bootstrapping a new architecture, it should be sufficient to only build the architecture dependent packages and therefor to only build the build-arch target in debian/rules and not the build-indep target. Only considering the Build-Depends field and dismissing the Build-Depends-Indep field, reduced the main scc from 1027 vertices to 979 vertices. The amount of cycles up to length four reduced from 276 to 206. Especially the cycles containing gtk-doc-tools, doxygen, debiandoc-sgml and texlive-latex-base got much less. Patrick managed to add a Build-Depends-Indep field to four packages so far which reduced the scc further by 14 vertices down to 965 vertices. So besides staged build dependencies and cross building there is now a third method that can be applied to break dependency cycles: add Build-Depends-Indep information to them or update existing information. I submitted a list of packages that have a binary-indep and/or a build-indep target in their debian/rules to the list. I also submitted a patch for dose3 to be able to specify to ignore Build-Depends-Indep and Build-Conflicts-Indep information.

Dose3 crossbuilding So far I only looked at dependency situations in the native case. While the native case contains a huge scc of about 1000 packages, the dependency situation will be much nicer when cross building. But dose3 was so far not able to simulate cross building of source packages. I wrote a patch that implements this functionality and will allow me to write programs that help analyze the cross-situation as well.

Debconf Presentation Wookey was giving a talk at debconf12 for which I was supplying him with slides. The slides in their final version can be downloaded here

Future Patrick maintains a list of "weak" build dependencies. Those are dependencies that are very likely to be droppable in either a staged build or using Build-Depends-Indep. I must make use of this list to make it easier to find packages that can easily be removed of their dependencies. I will have to implement support for resolving the main scc using staged build dependencies. Since it is unlikely that Patrick will be fast enough in supplying me with modified packages, I will need to create myself a database of dummy packages. Another open task is to allow to analyze the crossbuilding dependency situation. What I'm currently more or less waiting on is the inclusion of my patches into dose3 as well as a decision on the buildprofile format. More people need to discuss about it until it can be included into tools as well as policy. Every maintainer of a package can help making bootstrapping easier by making sure that as many dependencies as possible are part of the Build-Depends-Indep field.

04 July 2012

Johannes Schauer: enumerating elementary circuits of a directed_graph

For my GSoC project this year I need to be able to enumerate all elementary circuits of a directed graph. My code is written in Ocaml but neither the ocamlgraph library nor graph libraries for other languages seem to implement a well tested algorithm for this task. In lack of such a well tested solution to the problem, I decided to implement a couple of different algorithms. Since it is unlikely that different algorithms yield the same wrong result, I can be certain enough that each individual algorithm is working correctly in case they all agree on a single solution. As a result I wrote a testsuite, containing an unholy mixture of Python, Ocaml, D and Java code which implements algorithms by D. B. Johnson, R. Tarjan, K. A. Hawick and H. A. James.

Algorithm by R. Tarjan The earliest algorithm that I included was published by R. Tarjan in 1973.
Enumeration of the elementary circuits of a directed graph
R. Tarjan, SIAM Journal on Computing, 2 (1973), pp. 211-216
http://dx.doi.org/10.1137/0202017
I implemented the pseudocode given in the paper using Python. The git repository can be found here: https://github.com/josch/cycles_tarjan

Algorithm by D. B. Johnson The algorithm by D. B. Johnson from 1975 improves on Tarjan's algorithm by its complexity.
Finding all the elementary circuits of a directed graph.
D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
http://dx.doi.org/10.1137/0204007
In the worst case, Tarjan's algorithm has a time complexity of O(n e(c+1)) whereas Johnson's algorithm supposedly manages to stay in O((n+e)(c+1)) where n is the number of vertices, e is the number of edges and c is the number of cycles in the graph. I found two implementations of Johnson's algorithm. One was done by Frank Meyer and can be downloaded as a zip archive. The other was done by Pietro Abate and the code can be found in a blog entry which also points to a git repository. The implementation by Frank Meyer seemed to work flawlessly. I only had to add code so that a graph could be given via commandline. The git repository of my additions can be found here: https://github.com/josch/cycles_johnson_meyer Pietro Abate implemented an iterative and a functional version of Johnson's algorithm. It turned out that both yielded incorrect results as some cycles were missing from the output. A fixed version can be found in this git repository: https://github.com/josch/cycles_johnson_abate

Algorithm by K. A. Hawick and H. A. James The algorithm by K. A. Hawick and H. A. James from 2008 improves further on Johnson's algorithm and does away with its limitations.
Enumerating Circuits and Loops in Graphs with Self-Arcs and Multiple-Arcs.
Hawick and H.A. James, In Proceedings of FCS. 2008, 14-20
www.massey.ac.nz/~kahawick/cstn/013/cstn-013.pdf
In contrast to Johnson's algorithm, the algorithm by K. A. Hawick and H. A. James is able to handle graphs containing edges that start and end at the same vertex as well as multiple edges connecting the same two vertices. I do not need this functionality but add the code as additional verification. The paper posts extensive code snippets written in the D programming language. A full, working version with all pieces connected together can be found here: https://github.com/josch/cycles_hawick_james The algorithm was verified using example output given in the paper. The project README states how to reproduce it.

Input format All four codebases have been modified to produce executables that take the same commandline arguments which describes the graphs to investigate. The first argument is the number of vertices of the graph. Subsequent arguments are ordered pairs of comma separated vertices that make up the directed edges of the graph. Lets look at the following graph as an example: cycle example The DOT source for this graph is:
digraph G  
  0;
  1;
  2;
  0 -> 1;
  0 -> 2;
  1 -> 0;
  2 -> 0;
  2 -> 1;
   
To generate the list of elementary circuits using Tarjan's algorithm for the graph above, use:
$ python cycles.py 3 0,1 0,2 1,0 2,0 2,1
0 1
0 2
0 2 1
The commandline arguments are the exact same for the other three methods and yield the same result in the same order. If the DOT graph is in a format as simple as above, the following sed construct can be used to generate the commandline argument that represents the graph:
$ echo  sed -n -e '/^\s*[0-9]\+;$/p' graph.dot   wc -l   sed -n -e 's/^\s*\([0-9]\) -> \([0-9]\);$/\1,\2/p' graph.dot 

Testsuite As all four codebases take the same input format and have the same output format, it is now trivial to write a testsuite that compares the individual output of each algorithm for the same input and checks for differences. The code of the testsuite is available via this git repository: https://github.com/josch/cycle_test The other four repositories exist as submodules of the testsuite repository.
$ git clone git://github.com/josch/cycle_test.git
$ cd cycle_test
$ git submodule update --init
A testrun is done via calling:
$ ./test.sh 11
The argument to the shell script is an integer denoting the maximum number N of vertices for which graphs will be generated. The script will compile the Ocaml, Java and D sourcecode of the submodules as well as an ocaml script called rand_graph.ml which generates random graphs from v = 1..N vertices where N is given as a commandline argument. For each number of vertices n, e = 1..M number of edges are chosen where M is maximum number of edges given the number of vertices. For every combination of number of vertices v and number of edges e, a graph is randomly generated using Pack.Digraph.Rand.graph from the ocamlgraph library. Each of those generated graphs is checked for cycles and written to a file graph-v-e.dot if the graph contains a cycle. test.sh then loops over all generated dot files. It uses the above sed expression to convert each dot file to a commandline argument for each of the algorithms. The outputs of each algorithm are then compared with each other and only if they do not differ, does the loop continue. Otherwise the script returns with an exit code. The tested algorithms are the Python implementation of Tarjan's algorithm, the iterative and functional Ocaml implementation as well as the Java implementation of Johnson's algorithm and the D implementation of the algorithm by Hawick and James.

Future developments Running the testsuite with a maximum of 12 vertices takes about 33 minutes on a 2.53GHz Core2Duo and produces graphs with as much as 1.1 million cycles. It seems that all five implementations agree on the output for all 504 generated graphs that were used as input. If there should be another implementation of an algorithm that enumerates all elementary circuits of a directed graph, I would like to add it. There are some more papers that I would like to read but I lack access to epubs.siam.org and ieeexplore.ieee.org and would have to buy them. Benchmarks seem a bit pointless as not only the algorithms are very different from each other (and there are many ways to tweak each of them) but also the programming languages differ. Though for the curious kind, it follows the time each algorithm takes to enumerate all cycles for all generated graphs up to 11 vertices.
algorithmtime (s)
Johnson, Abate, Ocaml, iterative137
Johnson, Abate, Ocaml, functional139
Tarjan, Python153
Hawick, D175
Johnson, Meyer, Java357
The iterative Ocaml code performs as well as the functional one. In practice, the iterative code should probably be preferred as the functional code is not tail recursive. On the other hand it is also unlikely that cycles ever grow big enough to make a difference in the used stack space. The Python implementation executes surprisingly fast, given that Tarjan's algorithm is supposedly inferior to Johnson's and given that Python is interpreted but the Python implementation is also the most simple one with the least amount of required datastructures. The D code potentially suffers from the bigger datastructures and other bookkeeping that is required to support multi and self arcs. The java code implements a whole graph library which might explain some of its slowness.

02 July 2012

Johannes Schauer: port bootstrap build-ordering tool report 3

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de.

Diary

June 18 Pietro suggests a faster way to generate installation sets for a list of packages. In my case, I need an installation set for every source package in the archive to find out how often a binary package is needed to build a source package. As a result, the speed is doubled in contrast to the original approach.

June 19
  • adapt code to work with new dose release 3.0
  • remove unneeded parts of code
  • add different possibilities to find amount of source packages that need a binary package
  • add code to get multiple installation sets using Depsolver_int.solve

June 20
  • add ~global_constraints:false to Depsolver.listcheck, Depsolver.trim and Depsolver.edos_install calls
  • adapt output graph to limited xdot capabilities

June 21 I formulate an email to the list, reporting of dependency graphs of debhelper, cdbs, pkg-config and libgtk2.0-dev. My current technique gets an installation set for a source package, removes all those that are already installable and adds the others as a dependency of that source package. This dependency will include an installation set of that binary as well minus all packages that are already available. The problem with that approach are dependency cycles created by long dependency chains. Example: src:A needs B needs C needs A. B and C would both be added as a dependency of src:A. B as well as C would also include their installation set which in both cases includes A. So now there are two cycles: src:A->B->A and src:A->C->A. For a real life example, look at the following situation of cdbs and src:sqlite3. cdbs old situation It is created because src:sqlite3 needs cdbs needs python-scour needs python needs python2.7 needs libsqlite3-0. Therfor libsqlite3-0 is in the installation set of cdbs, python-scour, python and python2.7. This creates five cycles in the graph even though there is only one. It would be better to reduce the dependencies added to src:sqlite3 to its direct dependency which is cdbs. Package dependencies are disjunctions from which the solver chooses one or the other to build an installation set. To solve the problem above I would need to know which disjunction the solver chose and then only add the direct dependency of a package to the dependency graph.
  • improve build_compile_rounds performance
  • big overhaul of menu structure
  • fix subgraph extraction code

June 22
  • do not create a universe if not needed - use hashtables instead
  • for sorting packages, generating difference of package sets and otherwise comparing packages, always use CudfAdd.compare
  • as a custom list membership function, use List.exists instead of trying List.find
  • more speedup for build_compile_rounds
  • the number of source packages that can be built does NOT include the cross built packages
  • print closure members in graph
  • refactor code and move common functions to bootstrapCommon.ml
  • add breakcycles.ml for future code to break cycles using staged build dependencies
  • use more extlib functionality
  • extended package list input format

June 23 After several emails with Pietro I learn about syntactic dependency graphs. To document my progress and prove my efforts I committed the code as commit 6684c13. But this code was soon found out to be unecessary so it will be removed later and just serves as documentation.

June 24 I came up with another (better?) solution to get the chosen disjunctions. It simply uses the calculated installation set to decide for each disjunction which one was taken by the solver. I reported that important step and the open questions involved with it in an email to the list. The problem always was, that an installation set can easily contain more than one package of a disjunction. In this case it is not clear which branch was chosen by the solver. I found, that in Ubuntu Natty there are only 6 such packages and for each of them the situation can be solved. It can be solved because in all of those cases it is that either one package of a disjunction provides the other or that both packages depend upon each other, which means that both have to be included.

June 27
  • use installation set to flatten build dependencies of source packages
  • refactor code and move common functions to bootstrapCommon.ml

June 25 I have to have an algorithm that finds all circuits in a given graph. This is necessary so that:
  1. cycles can be enumerated for huge dependency graphs where cycles are hard to see
  2. cycles can be enumerated to find a cycle that can be broken using staged build dependencies
It seems that Johnson's algorithm is the best way to do this complexity wise and Pietro already blogged about the problem together with an implementation of the algorithm in ocaml. Unfortunately it turns out that his code doesnt implement the algorithm correctly and hence misses out on some cycles. The fix seems not to be too trivial so I'm still investigating it.

June 28
  • add crosseverything.ml to obtain a list of source packages that, if cross compiled, would make the whole archive available

Results While the first week was productive as usual, I had to work some time on a University project during the second week as well as attend a family meeting. I will catch up with the lost time over the course of the next week.

dose3 Using dose 3.0 (which fixes a bug about essential packages) the output of the algorithms is now likely less wrong then before.

performance Performance was improved in the generation of installation sets as well as in the code that tries out how many packages can be built in multiple rounds. This was achieved by more caching, less unnecessary operations in looping constructs, replacing lists with hashtables, not creating universes where not necessary.

user interface The main program, basenocycles.ml now has a much better menu structure.

input format The programs take two package file inputs. The list of source packages that has to be cross built for a minimal build system and the list of source packages that was chosen to be cross compiled in addition to that. Both files list one source package per line and now allow comments.

refactoring As more and more scripts are added, more and more functionality is moved to bootstrapCommon.ml which makes each script much cleaner.

what to test for cross building As discussed in the "Future" section of the last report, I now automated the process of finding out which packages, if they were cross compiled, would make the whole archive available because they break all cycles and allow native compilation of the rest. The outcome: to build 3333 out of 3339 packages in natty natively, at most 186 source packages must be cross compiled. The other six cannot be compiled because of version mismatches in the Natty Sources.bz2 and Packages.bz2. The code can be run from crosseverything.ml.

limit source dependencies to direct dependencies Reducing the dependencies of source packages from their full installation set to their direct dependencies by finding out which disjunction of their dependency list were taken, greatly simplifies the dependency graphs. The dependency graph created for libgtk2.0-dev could be reduced from 491 to 247 vertices for a depth of three. For cdbs it is now clearly visible that cdbs depends on libsqlite3-0 which builds from src:sqlite3 which depends on cdbs. Before: cdbs old situation After: cdbs new situation For pkg-config the graph also has been reduced to the one single cycle that matters: src:pkg-config needs libglib2.0-dev which depends on pkg-config which builds from src:pkg-config. Before: pkg-config old situation After: pkg-config old situation

Future I will prepare content for wookey's debconf talk on crossbuilding and bootstrapping. As this will include directions how to use the current code, I will kill two birds with one stone and write some proper documentation for my current source. The following two lists will be displayed after a dependency graph is calculated and reduced to its scc:
  • those source packages that have the least build dependencies not fulfilled. Those might be candidates for easy staged build dependencies. Since the source package is part of the scc, it will definitely be involved in some cycle somewhere.
  • those binary packages that most source packages depend upon. Those could be candidates for cross compilation as it might be easier to cross compile the source package than using staged build dependencies.
Patrick managed to cross build packages with sbuild now. So the list of packages that crosseverything.ml produces can now be checked efficiently for cross buildability. With this list, potentially more cycles can be broken out of the box. A feature will be added that allows the user to remove all packages from a dependency graph that can be cross compiled without any additional effort. Version mismatches between source and binary packages in Sources.bz2 and Packages.bz2 respectively in Ubuntu make the scripts fail and/or produce wrong results. Debian (even Sid) doesnt have this problem so I should find out where to report this problem to Ubuntu. I need to write a working version of Johnson's algorithm because much functionality depends upon it. I have the option to improve Pietro's version or write one from scratch. Writing one from scratch might be easier as I have Pietro's code as template as well as a Java implementation of Johnson's algorithm which seems to work. The following functionalities need working cycle enumeration:
  • given source packages with staged build dependencies, an enumeration of cycles is needed to find out which cycles can be broken by building packages staged. It makes less sense to blindly build a package stage and then check if this makes more packages available.
  • display cycles of a dependency graph to the user. After obtaining all cycles in the graph it makes sense to sort them by their length. The user would then investigate the situation of the smallest cycles first. This makes sense because breaking small cycles can potentially break bigger cycles. Since in the end, all cycles have to be eliminated anyway, it makes sense for the user to first tackle the small ones.
  • display the feedback arc set to the user. The packages in the feedback arc set might be very good candidates for reduced build dependencies or cross compilation.

17 June 2012

Johannes Schauer: port bootstrap build-ordering tool report 2

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de. Diary June 4 I added the first version of basenocycles.ml to git. Given an initial set of cross built packages, it tries to compile as much as possible on the resulting system in multiple rounds. June 5 During June 3, I discovered and error in my program that would only come up when using the Debian Sid package lists as the input:
Fatal error: exception Assert_failure("common/edosSolver.ml", 610, 11)
On this day, June 5, I wrote a minimal test case for this problem. The same day, Pietro figured out that this is a bug in dose which will be fixed in the next release. Begin writing code to figure out how important a binary package is for the further build process. Try to use Depsolver.edos_install to find out what packages are needed to make debhelper available. Restructure basenocycles.ml, exclude source packages that already have been built, still trouble with already existing binary packages and Cudf.mem_installed, comment stuff better. June 6 I wrote some crude code (only estimate, not correct, fixed later) that would give a rough overview of how often a given binary package is directly required as a build dependency. Debhelper came out as the most needed package. It is architecture:all, so it does not have to be built but it has unfulfilled runtime dependencies. To make those packages available, 13 (actually 11, fixed later) packages have to be compiled on Ubuntu Natty. But those packages all (except gettext) require debhelper itself to be built. The first dependency cycle. This dependency cycle (actually, the 12 cycles) can be broken by either cross compiling those source packages or by making them build without debhelper. One goal of the program is to help decide what the easier option is, but this is not yet implemented. To play around a bit, I created the possibility to specify a list of packages that are additionally to the minimal set of cross compiled packages also cross compiled. I added the 13 packages found above to the list, thus making the binary packages they build available. This made debhelper available in the system. As a result, 1625 out of 3339 source packages can be built with just a minimal build system (priority:essential packages plus build-essential) and debhelper available. The next package that blocks the most other source packages from being built is cdbs. The next nine packages in that list also require cdbs so it seems to be the next important package to be available. Pietro's suggestions make me:
 - do not open BootstrapCommon but ExtLib, Common, Algo, Debian
 - do proper option parsing and logging
 - use Debcudf.ignore_essential = true
 - do Debcudf.init_tables (binlist@srclist)
 - use @ with shorter list first
 - use more List.rev_append instead of @
 - use CudfAdd.who_provides to find if a package is available
June 7 Pietro and Patrick suggest that for solving the debhelper cycles, one could provide a minimal debhelper version so that the above list of 12 packages can be built without debhelper. I try to figure out how to get a list of packages that are missing to make a package installable/buildable. This functionality should be provided in dose but I fail to find it. June 8 Lacking a solution of the problem of June 7, I write a mail to Pietro. I start my first graphs in ocaml using the ocamlgraph library. The graph I generate, starts off at a binary package. For each binary package it connects new vertices as its runtime dependencies. If a binary package is not arch:all and also not yet otherwise compiled, its source package is also added. The result is a graph in which set of source packages in it will make the initial package available, if those source packages would be cross compiled. The graph is extended further than the source packages. June 9 I refine a couple of functions, make univ_get_pkg_by_name return the package with the highest version number. I wrote a rather lengthy (1027 words) email to the list that explains my status as of this day. I can create graphviz dot files with ocaml, can create node and edge types and create the graph by an imperative pattern that I saw a lot in Pietro's code. Disjunctions are not yet handled correctly (see mail from June 8). The graphs generated look like the following: http://mister-muffin.de/p/8nyc.png June 11 I write a test case which shows how CudfAdd.who_provides doesnt find virtual packages. Automate the process of finding the packages that, if cross compiled, would make another package available. Add more predicates (identifying source packages) and improve input file reading code. Move build_compile_rounds which compiles as many source packages as it can in multiple rounds on a given minimal system a toplevel function and thereby abstract it more. Create a rudimentary text based menu to choose different actions to take for an analysis. Start writing an extended version of simple_dependency_graph for deeper analysis. Use xdot to show graphs from the text menu. Allow saving those graphs to a file. June 12 Move functionality from the extended version of simple_dependency_graph over to the normal version and delete the extended version. Add the new Closure vertex type. Create extended_dependency_graph which is supposed to not contain single binary package vertices but handle a package and its installation set as one vertex. The output of extended_dependency_graph is optionally reduced to the biggest (non degenerate) strongly connected component. User gets the option of choosing the exploration depth. June 13 Pietro replies to my mail from June 8 but apparently I failed to express myself well enough in my last mail, so I rephrase my question. Pietro replies to my email from June 11 and explains how the effect I see is due to "a nuisance of the debian to cudf encoding". As a result I change my code accordingly. June 14 Another lengthy (1130 words) email to the list. I explain what was done in the past days, what parts work and how they work. I list some rationales on why I did things the way I did them. The most important observation is, that after improving my code again and again, I ended up representing the dependency cycle problem in the same (very similar) way that Pietro suggested in the beginning. This is probably a good discovery. Lots of data of that email is now of only little use as of June 16, I make lots of improvements in correctness. As I dont have an answer to my other email to Pietro from June 13, I implement a very crude way to get an answer to the question of what packages are missing for a package to be available/compileable. I called it flatten_vpkgformula_best_effort and it suffers from many faults including disjunctions and package conflicts. Patrick spots a problem. As a result, I make sure that at no point, the source package of an arch:all package can be listed. June 15 As a reply to my mail from June 13, Pietro creates a new branch in the git and adds the code I needed to get a proper installation set. June 16 As a result of Pietro's efforts from June 15, I make great advancements on all fronts. Details of the current status follow in the next section. Results A big leap was made on June 16 due to Pietro's great help on making me understand how Depsolver.listcheck can be used for my purposes. My difficulties in finding the solution myself are rooted in many parts of the dose framework being poorly commented but Pietro did already a couple of documentation commits whenever things were unclear for me. Using Depsolver.listcheck makes it possible to be more distribution agnostic and I dont have to handle vpkgs, virtual packages and constraints myself anymore. The code also doesnt suffer anymore by wrongly analyzed dependencies and conflicts. The only thing that is not yet taken care of, is that Depsolver.listcheck only chooses one out of several possible installation set. A final version should be able to take into account that a different installation set could provide a better solution. Overall, in comparison to two weeks ago, I can now properly build, traverse and analyze graphs, can choose an installation set properly, understand more about dependencies, closures, dose and ocaml in general. Finding the importance of binary packages for building When calculating how many source packages are depending on the availability of a binary package I originally flattened the pkg.Cudf.depends list twice for a rough overview. This is of course wrong due to disjunctions and conflicts and also doesnt provide a deep dependency inspection. The new method is to calculate an installation set that is necessary to compile a source package for every source package. The resulting list of binary packages is then used to find out how often a binary package appears in an installation set. I see three drawbacks though: Removing simple graph The simple graph which contained single binary and source packages was removed. I realized it doesnt really serve any purpose to look at it. As a result, Bin vertices and InstallDep edges are also not part of the graph anymore. Since it was responsible for generating the list of source packages that have to be cross built to make a package available, I created a new function get_cross_source_packages which uses an installation to serve the same purpose. Fix extended_dependency_graph extended_dependency_graph now uses installation sets for generating the list of packages that is needed to compile a source package or install a binary package. The list of build dependencies does not include packages that are already installable. The list of runtime dependencies does not include packages that are otherwise available (cross built, arch:all...). Instead of checking for list membership all the time, I created hash tables for the list of installable as well as for the list of available binary packages. Future There are two big tasks for the next two weeks: Task one is to find a way to give hints on which packages to best consider for having reduced build dependencies. This would then probably finally make use of Pietro's cycle algorithms. Task two is to find a way to break cycles and create a build-DAG from a list of packages that already have staged build dependency information. Patrick is currently working on patching dpkg with Build-Depends-StageN dependencies as making perl cross compilable. If he doesnt need the ability to decide which packages to modify to have staged build dependencies in the near future, then task one is probably less urgent and therefor of less importance right now? On the other hand, I can easily generate fake reduced build dependencies so that doing task two right now would not be a problem. Also, having the solution for task two will make it possible to show the user what effect it would have to add reduced build dependencies to a package. For the reasons above (it's not urgent, task one profits from task two being solved) I will go and implement task two first (if there are no objections from my mentors). Another idea, that I discussed with wookey and Patrick yesterday, was that due to multiarch being used for more and more packages, there should exist a set of packages that is cross compilable without any change to the package. We agreed that I make a list of packages that, if cross compiled, would break dependency cycles and make other packages available. I created such a list of about 160 packages for Ubuntu Natty that, if cross compiled, made it possible to have 87% of Natty available (those numbers have to be treated with caution as I did not yet use the proper way of installation sets when generating that list, but the order of magnitude should be correct). Wookey can then try to cross compile those packages. If some packages of those "crucial" source packages are found to be cross compilable, then they should be cross compiled because it means that no work has to be done to break some cycles. Cross compiling all packages that are cross compilable out of the box is no solution, as only natively compiled packages can go into the archive. This is why the list of potentially additionally cross compiled source packages has to be kept as small as possible.

03 June 2012

Johannes Schauer: port bootstrap build-ordering tool report 1

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de. Diary May 21 May 22 May 23 May 24 May 25 May 29 May 30 May 31 June 1 June 2 Results I learned a good chunk of ocaml and how to use dose3 and libcudf. I created a gitorious project and a git repository for all the sourcecode.
git clone git://gitorious.org/debian-bootstrap/bootstrap.git
The git as of now contains 30 commits and 1197 lines of ocaml code. So far, 62 emails have been exchanged between me and Pietro and Wookey. I created a mailinglist for this project where all email exchange so far is publicly accessible in the archives. You can also download all of the email exchange in mbox format. Everybody is welcome to join and/or read the list. What seems to be finished: the program that finds the minimal amount of source packages that have to be cross compiled to end up with a minimal build system. What it does is:
  1. get all essential packages
  2. get their runtime dependencies
  3. get build-essential plus runtime dependencies
  4. get all source packages that are necessary to build 1.-3. those are the packages that have to be cross compiled
  5. get a list of all packages that are built by source packages from 4.
  6. add all packages from 1.,2.,3. and 5. plus all arch:all packages to a universe
  7. use Depsolver.trim on that universe to figure out which of those packages are actually installable
The result of 7. will then contain a list of packages that are available automatically on the foreign system due to cross compiled source packages and arch:all packages. For Debian Sid, the output of my program is:
# (1) number of packages with priority:required: 62
# (2) plus, number of dependencies of priority:required packages: 20
# (3) plus, build-essential and dependencies: 31
# number of source packages to build the above: 71
# number of additional packages built from the above source packages: 292
# (4) number of packages of those plus arch:all packages that are installable: 6421
# total number of installable packages (1)+(2)+(3)+(4): 6534
For Ubuntu Natty it is:
# (1) number of packages with priority:required: 96
# (2) plus, number of dependencies of priority:required packages: 7
# (3) plus, build-essential and dependencies: 31
# number of source packages to build the above: 87
# number of additional packages built from the above source packages: 217
# (4) number of packages of those plus arch:all packages that are installable: 2102
# total number of installable packages (1)+(2)+(3)+(4): 2236
So for Debian, 71 source packages definitely have to be made cross compilable while for Natty, the number is 87. The last two days I was toying around with these minimal systems to see how big the number of source packages is, that can be built on top of them without running into dependency cycles. After installing the binary packages that were built, I checked again until no new packages could be built. For Natty, I was only able to find 28 additional packages that can be built on top of the 2236 existing ones. This means that a number of dependency cycles prevent building anything else. In the coming two weeks I will focus on coming up with a tool that cleverly helps the user to identify packages that would be useful to have for building more packages (probably determined by how many packages depend on it - debhelper is an obvious candidate). The tool would then show why that crucial package is not available (in case of debhelper because some of its runtime dependencies are not available and require debhelper to be built) and how the situation can best be resolved. The possible methods to do so are to identify a package that is part of a cycle and either cross compile it or let it have staged build dependencies.

Johannes Schauer: cross-compilable and bootstrappable Debian

When packaging software for Debian, there exist two important assumptions:
  1. Compilation is done natively
  2. Potentially all of Debian is available at compile time
Both assumptions make the life of a package maintainer much easier and they do not create any problem unless you are one of the unlucky few who want to run Debian on an architecture that it does not yet exist for. You will then have to use other distributions like OpenEmbedded or Gentoo which you compiled (or retrieved otherwise) for that new architecture to hack a core of Debian source packages until they build a minimal Debian system that you can chroot into and continue natively building the rest of it. But even if you manage to get that far you will continue to be plagued by cyclic build and runtime dependencies. So you start to hack source packages so that they drop some dependencies and you can break enough cycles to advance step by step. The Debian ports page lists 24 ports of Debian, so despite its unpleasant nature, porting it is something that is not done seldom. The process as laid out above has a number of drawbacks: If Debian would provide a set of core packages that are cross-compilable and which suffice for a minimal foreign build system, and if it would also have enough source packages that provide a reduced build dependency set so that all dependency cycles can be broken, building Debian for a yet unknown architecture could be mostly automated. The benefits would be: With three of this year's GSoC projects, this dream seems to come into reach. There is the "Multiarch Cross-Toolchains" project by Thibaut Girka and mentored by Hector Oron and Marcin Juszkiewicz. Cross-compiling toolchains need packages from the foreign architecture to be installed alongside the native libraries. Cross-compiler packages have been available through the emdebian repositories but always were more of a hack. With multiarch, it is now possible to install packages from multiple architectures at once, so that cross-compilation toolchains can be realized in a proper manner and therefor can also enter the main archives. Besides creating multiarch enabled toolchains, he will also be responsible for making them build on the Debian builld system as cross-architecture dependencies are not yet supported. There is also the "Bootstrappable Debian" project by Patrick "P. J." McDermott and mentored by Wookey and Jonathan Austin. He will make a small set of source packages multiarch cross-compilable (using cross-compilers provided by Thibaut Girka) and add a Build-Depends-StageN header to critical packages so that they can be built with reduced build dependencies for breaking dependency cycles. He will also patch tools as necessary to recognize the new control header. And then there is my project: "Port bootstrap build-ordering tool" (Application). It is mentored by Wookey and Pietro Abate. In contrast to the other two, my output will be more on the meta-level as I will not modify any actual Debian package or patch Debian tools with more functionality. Instead the goal of this project is threefold:
  1. find the minimal set of source packages that have to be cross compiled
  2. help the user to find packages that are good candidates for breaking build dependency cycles through added staged build dependencies or by making them cross-compilable
  3. develop a tool that takes the information about packages that can be cross compiled or have staged build dependencies to output an ordering with which packages must be built to go from nothing to a full archive
More on that project in my follow-up post.

25 May 2012

Johannes Schauer: setting up mailman, postfix, lighttpd

I was worried about having to learn hundreds of configuration options to properly set up mailman, postfix and lighttpd on Debian Squeeze. Turned out, that except for lighttpd it all works out of the box.
apt-get install postfix
When asked by debconf, I specified lists.mister-muffin.de as the fully qualified domain name.
apt-get install mailman
newlist mailman
The newlist command reminds me that I have to add its output to /etc/aliases. After doing so, I have to run:
newaliases
From now on, I can add any mailinglist by running newlist, editing /etc/aliases and running newaliases. Mailinglists can also be added through the mailman webinterface but one still has to put the according entries into /etc/aliases. Following is a working lighttpd configuration that works out of the box with the default settings of mailman on Debian squeeze. This was the only part that caused me some headaches.
server.modules += ("mod_alias", "mod_cgi", "mod_accesslog")
$HTTP["host"] == "lists.mister-muffin.de"   accesslog.filename =
    accesslog.filename = "/var/log/lighttpd/lists-access-log"
    alias.url += (
        "/cgi-bin/mailman/private/" => "/var/lib/mailman/archives/private/",
        "/cgi-bin/mailman/public/" => "/var/lib/mailman/archives/public/",
        "/pipermail/" => "/var/lib/mailman/archives/public/",
        "/cgi-bin/mailman/"=> "/var/lib/mailman/cgi-bin/",
        "/images/mailman/" => "/usr/share/images/mailman/",
    )
    cgi.assign = (
        "/admin" => "",
        "/admindb" => "",
        "/confirm" => "",
        "/create" => "",
        "/edithtml" => "",
        "/listinfo" => "",
        "/options" => "",
        "/private" => "",
        "/rmlist" => "",
        "/roster" => "",
        "/subscribe" => "")
 
server.document-root        = "/var/www"
server.errorlog             = "/var/log/lighttpd/error.log"
server.pid-file             = "/var/run/lighttpd.pid"
server.username             = "www-data"
server.groupname            = "www-data"
index-file.names            = ( "index.html" )
server.dir-listing          = "disable"
include_shell "/usr/share/lighttpd/create-mime.assign.pl"
As a bonus, I wanted to import my existing email exchange with my GSoC mentors into the mailinglist. First I was planning on manually sending the email messages to the list, but a much easier option is to just import them in mbox format. To extract all email messages, I first wrote the following python snippet:
import mailbox, itertools
box = mailbox.mbox('~/out')
for message in itertools.chain(mailbox.mbox('~/sent'), mailbox.Maildir('~/Mail/Web/', factory=None)):
if (("wookey" in message.get('to', "").lower()
or "wookey" in message.get('cc', "").lower()
or "wookey" in message.get('from', "").lower()
or "abate" in message.get('to', "").lower()
or "abate" in message.get('cc', "").lower()
or "abate" in message.get('from', "").lower())
and not message['subject'][0] == '['
and not message['subject'] == "multistrap"):
box.add(message)
box.close()
It iterates through messages in my mbox and maildir mailboxes, filters them for emails by wookey or pietro, strips away some messages I found to not be relevant and then saves the filtered result into the mbox mailbox ~/out. It is important to specify factory=None for the Maildir parser, because it otherwise defaults to rfc822.Message instead of MaildirMessage. Also do not forget to call box.close(). I initially forgot to do so and ended up with missing messages in ~/out. I then copy the archive in its place:
scp out lists.mister-muffin.de:/var/lib/mailman/archives/private/debian-bootstrap.mbox/debian-bootstrap.mbox
Another thing that initially caused me trouble, was that the mbox didnt have the correct permissions due to the scp. Fixing them:
chown -R list:www-data /var/lib/mailman/archives/private/
chmod 664 /var/lib/mailman/archives/private/debian-bootstrap.mbox/debian-bootstrap.mbox
And update the mailman archive like this:
sudo -u list /usr/lib/mailman/bin/arch debian-bootstrap /var/lib/mailman/archives/private/debian-bootstrap.mbox/debian-bootstrap.mbox
Initially I was running the above command as root which screws up permissions as well.

21 May 2012

Johannes Schauer: sisyphus wins ICRA 2012 VMAC

Sisyphus is a piece of software that I wrote as a member of a team from Jacobs University led by Prof. Dr. Andreas N chter. It managed to place our team first in this year's IEEE ICRA 2012 Virtual Manufacturing Automation Competition in all three rounds. The goal was, to stack a given set of boxes of different length, height and width on a pallet in a way that achieved optimal volume utilization, center of mass and interlock of the boxes. Besides the cartesian placement of a box on the pallet, the only other degree of freedom was a 90 rotation of the box around a vertical axis. Since the arrangement of boxes into a three dimensional container is NP hard (three dimensional orthogonal knapsack), I decided for a heuristic for an approximate solution. The premise is, that there are many boxes of equal height which was the case in the test cases that were available from the 2011 VMAC. Given this premise, my heuristic was, to arrange the boxes into layers of equal height and then stack these layers on top of each other. A set of boxes that would be left over or too little from the start to form its own full layer, would then be stacked on the top of the completed layers. There is a video of how this looked like. My code is now online on github and it even documented for everybody who is not me (or a potential future me of course). This blog post is about the "interesting" parts of sisyphus. You can read about the overall workings of it in the project's README. Python dict to XML and XML to Python dict The evaluation program for the challenge is reading XML files and the pallet size and the list of articles with their sizes are also given in XML format. So I had to have a way to easily read article information from XML and to easily dump my data into XML format. Luckily, all the XML involved was not making use of XML attributes at all, so the only children a node had, where other nodes. Thus, the whole XML file could be represented as an XML dictionary with keys being tagnames and the values being other dictionaries or lists or strings or integers. The code doing that uses xml.etree.ElementTree and turns out to be very simple:
from xml.etree import ElementTree

def xmltodict(element):
def xmltodict_handler(parent_element):
result = dict()
for element in parent_element:
if len(element):
obj = xmltodict_handler(element)
else:
obj = element.text
if result.get(element.tag):
if hasattr(result[element.tag], "append"):
result[element.tag].append(obj)
else:
result[element.tag] = [result[element.tag], obj]
else:
result[element.tag] = obj
return result
return element.tag: xmltodict_handler(element)

def dicttoxml(element):
def dicttoxml_handler(result, key, value):
if isinstance(value, list):
for e in value:
dicttoxml_handler(result, key, e)
elif isinstance(value, basestring):
elem = ElementTree.Element(key)
elem.text = value
result.append(elem)
elif isinstance(value, int) or isinstance(value, float):
elem = ElementTree.Element(key)
elem.text = str(value)
result.append(elem)
elif value is None:
result.append(ElementTree.Element(key))
else:
res = ElementTree.Element(key)
for k, v in value.items():
dicttoxml_handler(res, k, v)
result.append(res)
result = ElementTree.Element(element.keys()[0])
for key, value in element[element.keys()[0]].items():
dicttoxml_handler(result, key, value)
return result

def xmlfiletodict(filename):
return xmltodict(ElementTree.parse(filename).getroot())

def dicttoxmlfile(element, filename):
ElementTree.ElementTree(dicttoxml(element)).write(filename)

def xmlstringtodict(xmlstring):
return xmltodict(ElementTree.fromstring(xmlstring))

def dicttoxmlstring(element):
return ElementTree.tostring(dicttoxml(element))
Lets try this out:
>>> from util import xmlstringtodict, dicttoxmlstring
>>> xmlstring = "<foo><bar>foobar</bar><baz><a>1</a><a>2</a></baz></foo>"
>>> xmldict = xmlstringtodict(xmlstring)
>>> print xmldict
 'foo':  'baz':  'a': ['1', '2'] , 'bar': 'foobar' 
>>> dicttoxmlstring(xmldict)
'<foo><baz><a>1</a><a>2</a></baz><bar>foobar</bar></foo>'
The dict container doesnt preserve order, but as XML doesnt require that, this is also not an issue. Arranging items in layers When it was decided, that I wanted to take the layered approach, it boiled down the 3D knapsack problem to a 2D knapsack problem. The problem statement now was: how to best fit small rectangles into a big rectangle? I decided for a simple and fast approach as it is explained in Jake Gordon's blog article. There is a demo of his code and should the site vanish from the net, the code is on github. This solution seemed to generate results that were "good enough" while simple to implement and fast to execute. If you look very hard, you can still see some design similarities between my arrange_spread.py and his packer.js code. Jake Gordon got his idea from Jim Scott who wrote an article of arranging randomly sized lightmaps into a bigger texture. There is also an ActiveState Code recipe from 2005 which looks very similar to the code by Jake Gordon. The posts of Jake Gordon and Jim Scott explain the solution well, so that I dont have to repeat it. Should the above resources go offline, I made backups of them here and here. There is also a backup of the ActiveState piece here. Spreading items out The algorithm above would cram all rectangles into a top-left position. As a result, there would mostly be space at the bottom and left edge of the available pallet surface. This is bad for two reasons:
  1. the mass is distributed unequally
  2. articles on the layer above at the bottom or left edge, are prone to overhang too much so that they tumble down
Instead all articles should be spread over the available pallet area, creating small gaps between them instead big spaces at the pallet borders. Since articles were of different size, it was not clear to me from the start what "equal distribution" would even mean because it was obvious that it was not as simple as making the space between all rectangles equal. The spacing had to be different between them to accommodate for differently sized boxes. The solution I came up with, made use of the tree structure, that was built by the algorithm that arranged the rectangles in the first place. The idea is, to spread articles vertically first, recursively starting with the deepest nodes and spreading them out in their parent rectangle. And then spreading them horizontally, spreading the upper nodes first, recursively resizing and spreading child nodes. The whole recursion idea created problems of its own. One of the nicest recursive beauty is the following function:
def get_max_horiz_nodes(node):
if node is None or not node['article']:
return [], []
elif node['down'] and node['down']['article']:
rightbranch, sr = get_max_horiz_nodes(node['right'])
rightbranch = [node] + rightbranch
downbranch, sd = get_max_horiz_nodes(node['down'])
ar = rightbranch[len(rightbranch)-1]['article']
ad = downbranch[len(downbranch)-1]['article']
if ar['x']+ar['width'] > ad['x']+ad['width']:
return rightbranch, sr+[downbranch[0]]
else:
return downbranch, sd+[rightbranch[0]]
else:
rightbranch, short = get_max_horiz_nodes(node['right'])
return [node] + rightbranch, short
get_max_horiz_nodes() traverses all branches of the tree that node has below itself and returns a tuple containing the list of nodes that form the branch that stretches out widest plus the list of nodes that are in the other branches (which are shorter than the widest). Another interesting problem was, how to decide on the gap between articles. This was interesting because the number resulting of the subtraction of the available length (or width) and the sum of the articles lengths (or widths), was mostly not divisible by the amount of gaps without leaving a rest. So there had to be an algorithm that gives me a list of integers, neither of them differing by more than one to any other, that when summed up, would give me the total amount of empty space. Or in other words: how to divide a number m into n integer pieces such that each of those integers doesnt differ more than 1 from any other. Surprisingly, generating this list doesnt contain any complex loop constructs:
>>> m = 108 # total amount
>>> n = 7 # number of pieces
>>> d,e = divmod(m, n)
>>> pieces = (e)*[(d+1)]+(n-e)*[d]
>>> print pieces
[16, 16, 16, 15, 15, 15, 15]
>>> sum(pieces) == m
True
>>> len(pieces) == n
True
You can test out the algorithms that arrange rectangles and spread them out by cloning the git and then running:
PYTHONPATH=. python legacy/arrange_spread.py
The results will be svg files test1.svg and test2.svg, the latter showing the spread-out result. Here is an example how the output looks like (without the red border which is drawn to mark the pallet border): arrange_spread2.py contains an adaption of arrange_spread.py for the actual problem. Permutations without known length When creating a layer out of articles of same height, then there are four strategies that I can choose from. It is four because there are two methods that I can either use or not. I can rotate the article by 90 per default or not and I can rotate the pallet or not. So every time that I build a new layer, there are those four options. Depending on which strategy I choose, there is a different amount of possible leftover articles that did not fit into any layer. The amount is different because each strategy is more or less efficient. To try out all combinations of possible layer arrangements, I have to walk through a tree where at each node I branch four times for each individual strategy. Individual neighboring nodes might be the same but this outcome is unlikely due to the path leading to those neighboring nodes being different. To simplify, lets name the four possible strategies for each layers 0, 1, 2 and 3. I now want an algorithm that enumerates through all possible permutations of those four numbers for "some" length. This is similar to counting. And the itertools module comes with the product() method that nearly does what I want. For example, should I know that my tree does not become deeper than 8 (read: no more than eight layers will be built), then I can just run:
>>> for i in itertools.product([0,1,2,3], repeat=8):
...     print i
...
(0,0,0,0,0,0,0,0)
(0,0,0,0,0,0,0,1)
(0,0,0,0,0,0,0,2)
(0,0,0,0,0,0,0,3)
(0,0,0,0,0,0,1,0)
(0,0,0,0,0,0,1,1)
(0,0,0,0,0,0,1,2)
(0,0,0,0,0,0,1,3)
This would work if the number of layers created with each strategy was the same. But as each strategy behaves differently depending on the input, it cannot be known before actually trying out a sequence of strategies, how many layers it will yield. The strategy (0,0,0,0,0,0,0,0) might create 7 layers, resulting in (0,0,0,0,0,0,0,1), (0,0,0,0,0,0,0,2) and (0,0,0,0,0,0,0,3) yielding the same output as only the first 7 strategies count. This would create duplicates which I should not waste cpu cycles on later. It might also be that (0,0,0,0,0,0,1,0) turns out to be a combination of strategies that creates more than 8 layers in which case the whole thing fails. So what I need is a generator, that gives me a new strategy depending on how often it is asked for one. It should dynamically extend the tree of possible permutations to accommodate for any size. Since the tree will become humongous (4^11 = 4194304), already traversed nodes should automatically be cleaned so that only the nodes that make the current list of strategies stays in memory at any point in time. This sounded all complicated which made me even more surprised by how simple the solution was in the end. Here a version of the algorithm that could easily be ported to C:
class Tree:
def __init__(self, branch_factor):
self.branch_factor = branch_factor
self.root = "value": None, "parent": None, "children": []
self.current = self.root

def next(self):
if not self.current["children"]:
self.current["children"] = [ "value":val, "parent":self.current, "children":[] for val in range(self.branch_factor)]
self.current = self.current["children"][0]
return self.current["value"]

def reset(self):
if self.current["parent"]:
self.current["parent"]["children"].pop(0)
else:
return False
if self.current["parent"]["children"]:
self.current = self.root
return True
else:
self.current = self.current["parent"]
return self.reset()

def __str__(self):
return str(self.root)
It would be used like this:
>>> tree = Tree(4)
>>> print tree.next(), tree.next(), tree.next()
>>> while tree.reset():
...     print tree.next(), tree.next(), tree.next()
Which would be equivalent to calling itertools.product([1,2,3,4], 3). The special part is, that in each iteration of the loop I can call tree.next() an arbitrary amount of times, just how much it is needed. Whenever I cannot generate an additional layer anymore, I can call tree.reset() to start a new permutation. For my code I used a python specific version which is a generator:
def product_varlength(branch_factor):
root = "value": None, "parent": None, "children": []
current = root
while True:
if not current["children"]:
current["children"] = [ "value":val, "parent":current, "children":[] for val in range(branch_factor)]
current = current["children"][0]
if (yield current["value"]):
while True:
if current["parent"]:
current["parent"]["children"].pop(0)
else:
return
if current["parent"]["children"]:
current = root
break
else:
current = current["parent"]
It is used like this:
it = product_varlength(4)
print it.next(), it.send(False), it.send(False)
while True:
    print it.send(True), it.send(False), it.send(False)
Again, the expression in the loop can have any number of it.send(False). The first it.send(True) tells the generator to do a reset.

19 May 2012

Johannes Schauer: network file transfer to marvell kirkwood

I have a Seagate GoFlex Net with two 2TB harddrives attached to it via SATA. The device itself is connected to my PC via its Gigabit Ethernet connection. It houses a Marvell Kirkwood at 1.2GHz and 128MB. I am booting Debian from a USB stick connected to its USB 2.0 port. The specs are pretty neat so I planned it as my NAS with 4TB of storage being attached to it. The most common use case is the transfer of big files (1-10 GB) between my laptop and the device. Now what are the common ways to achieve this? scp:
scp /local/path user@goflex:/remote/path
rsync:
rsync -Ph /local/path user@goflex:/remote/path
sshfs:
sshfs -o user@goflex:/remote/path /mnt
cp /local/path /mnt
ssh:
ssh user@goflex "cat > /remote/path" < /local/path
I then did some benchmarks to see how they perform: scp: 5.90 MB/s rsync: 5.16 MB/s sshfs: 5.05 MB/s ssh: 5.42 MB/s Since they all use ssh for transmission, the similarity of the result does not come as a surprise and 5.90 MB/s are also not too shabby for a plain scp. It means that I can transfer 1 GB in a bit under three minutes. I could live with that. Even for 10 GB files I would only have to wait for half an hour which is mostly okay since it is mostly known well in advance that a file is needed. But lets see if we can somehow get faster than this. Lets analyze where the bottleneck is. Lets have a look at the effective TCP transfer rate with netcat:
ssh user@goflex "netcat -l -p 8000 > /dev/null"
dd if=/dev/zero bs=10M count=1000   netcat goflex 8000
79.3 MB/s wow! Can we get more? Lets try increasing the buffer size on both ends. This can be done using nc6 with the -x argument on both sides.
ssh user@goflex "netcat -x -l -p 8000 > /dev/null"
dd if=/dev/zero bs=10M count=1000   netcat -x gloflex 8000
103 MB/s okay this is definitely NOT the bottleneck here. Lets see how fast I can read from my harddrive:
hdparm -tT /dev/sda
114.86 MB/s.. hmm... and writing to it?
ssh user@goflex "time sh -c 'dd if=/dev/zero of=/remote/path bs=10M count=100; sync'"
42.93 MB/s Those values are far faster than my puny 5.90 MB/s I get with scp. A look at the CPU usage during transfer shows, that the ssh process is at 100% CPU usage the whole time. It seems the bottleneck was found to be ssh and the encryption/decryption involved. I'm transferring directly from my laptop to the device. Not even a switch is in the middle so encryption seems to be quite pointless here. Even authentication doesnt seem to be necessary in this setup. So how to make the transfer unencrypted? The ssh protocol specifies a null cipher for not-encrypted connections. OpenSSH doesnt support this. Supposedly, adding
  "none", SSH_CIPHER_NONE, 8, 0, 0, EVP_enc_null  
to cipher.c adds a null cipher but I didnt want to patch around in my installation. So lets see how a plain netcat performs.
ssh user@goflex "netcat -l -p 8000 > /remote/path"
netcat goflex 8000 < /local/path
32.9 MB/s This is far better! Lets try a bigger buffer:
ssh user@goflex "netcat -x -l -p 8000 > /remote/path"
netcat -x goflex 8000 < /local/path
37.8 MB/s now this is far better! My Gigabyte will now take under half a minute and my 10 GB file under five minutes. But it is tedious to copy multiple files or even a whole directory structure with netcat. There are far better tools for this. An obvious candidate that doesnt encrypt is rsync when being used with the rsync protocol.
rsync -Ph /local/path user@goflex::module/remote/path
30.96 MB/s which is already much better! I used the following line to have the rsync daemon being started by inetd:
rsync stream tcp nowait root /usr/bin/rsync rsyncd --daemon
But it is slower than pure netcat. If we want directory trees, then how about netcatting a tarball?
ssh user@goflex "netcat -x -l -p 8000   tar -C /remote/path -x"
tar -c /local/path   netcat goflex 8000
26.2 MB/s so tar seems to add quite the overhead. How about ftp then? For this test I installed vsftpd and achieved a speed of 30.13 MB/s. This compares well with rsync. I also tried out nfs. Not surprisingly, its transfer rate is up in par with rsync and ftp at 31.5 MB/s. So what did I learn? Lets make a table: </dr></dr></dr></dr></dr></dr></dr></dr></dr></dr>
methodspeed in MB/s
scp5.90
rsync+ssh5.16
sshfs5.05
ssh5.42
netcat32.9
netcat -x37.8
netcat -x tar26.2
rsync30.96
ftp30.13
nfs31.5
For transfer of a directory structure or many small files, unencrypted rsync seems the way to go. It outperforms a copy over ssh more than five-fold. When the convenience of having the remote data mounted locally is needed, nfs outperforms sshfs at speeds similar to rsync and ftp. As rsync and nfs already provide good performance, I didnt look into a more convenient solution using ftp. My policy will now be to use rsync for partial file transfers and mount my remote files with nfs. For transfer of one huge file, netcat is faster. Especially with increased buffer sizes it is a quarter faster than without. But copying a file with netcat is tedious and hence I wrote a script that simplifies the whole remote-login, listen, send process to one command. First argument is the local file, second argument is the remote name and path just as in scp.
#!/bin/sh -e
HOST=$ 2%%:* 
USER=$ HOST%%@* 
if [ "$HOST" = "$2" -o "$USER" = "$HOST" ]; then
        echo "second argument is not of form user@host:path" >&2
        exit 1
fi
HOST=$ HOST#*@ 
LPATH=$1
LNAME= basename "$1" 
RPATH= printf %q $ 2#*: /$LNAME 
ssh "$USER@$HOST" "nc6 -x -l -p 8000 > $RPATH" &
sleep 1.5
pv "$LPATH"   nc6 -x "$HOST" 8000
wait $!
ssh "$USER@$HOST" "md5sum $RPATH" &
md5sum "$LPATH"
wait $!
I use pv to get a status of the transfer on my local machine and ssh to login to the remote machine and start netcat in listening mode. After the transfer I check the md5sum to be sure that everything went fine. This step can of course be left out but during testing it was useful. Escaping of the arguments is done with printf %q. Problems with the above are the sleep, which can not be avoided but must be there to give the remote some time to start netcat and listen. This is unclean. A next problem with the above is, that one has to specify a username. Another is, that in scp, one has to double-escape the argument while above this is not necessary. The host that it netcats to is the same as the host it ssh's to. This is not necessarily the case as one can specify an alias in ~/.ssh/config. Last but not least this only transfers from the local machine to the remote host. Doing it the other way round is of course possible in the same manner but then one must be able to tell how the local machine is reachable for the remote host. Due to all those inconveniences I decided not to expand on the above script. Plus, rsync and nfs seem to perform well enough for day to day use.

18 May 2012

Johannes Schauer: a periodic counter

tldr: counting without cumulative timing errors Sometimes I want just a small counter, incrementing an integer each second running somewhere in a terminal. Maybe it is because my wristwatch is in the bathroom or because I want to do more rewarding things than counting seconds manually. Maybe I want not only to know how long something takes but also for how long it already ran in the middle of its execution? There are many reason why I would want some script that does nothing else than simply counting upward or downward with some specific frequency. Some bonuses: Now this can not be hard, right? One would probably write this line and be done with it:
while sleep 1; do echo $i; i=$((i+1)); done
or to count for a certain number of steps:
for i in  seq 1 100 ; do echo $i; sleep 1; done
This would roughly do the job but in each iteration some small offset would be added and though small, this offset would quickly accumulate. Sure that cumulative error is tiny but given that this task seems to be so damn trivial I couldn't bear anymore with running any of the above but started looking into a solution. Sure I could just quickly hack a small C script that would check gettimeofday(2) at each iteration and would adjust the time to usleep(3) accordinly but there HAD to be people before me with the same problem who already came up with a solution. And there was! The solution is the sleepenh(1) program which, when given the timestamp of its last invocation and the sleep time in floating point seconds, will sleep for just the right amount to keep the overall frequency stable. The author suggests, that sleepenh is to be used in shell scripts that need to repeat an action in a regular time interval and that is just what I did. The result is trivial and simple but does just what I want: You can check it out and read how to use and what to do with it on github: https://github.com/josch/periodic Now lets compare the periodic script with the second example from above:
$ time sh -c 'for i in  seq 1 1000 ; do echo $i; sleep 1; done'
0.08s user 0.12s system 0% cpu 16:41.55 total
So after only 1000 iterations, the counter is already off by 1.55 seconds. This means that instead of having run with a frequency of 1.0 Hz, the actual frequency was 1.00155 Hz. Is it too much to not want this 0.155% of error?
$ time ./periodic -c 1000
0.32s user 0.00s system 0% cpu 16:40.00 total
1000 iterations took exactly 1000 seconds. Cool.

27 April 2012

Ana Beatriz Guerrero Lopez: Debian in the Google Summer of Code 2012

This year our efforts have paid off and despite there being more mentoring organizations than there were in 2011 (175 in 2011 and 180 in 2012), this year in Debian we got 81 submissions versus 43 submissions in 2011.
You can see here the graphs of applications against time from this year: 2012 The result is this year we ll have 15 students in Debian versus 9 students last year! Without further ado, here is the list of projects and student who will be working with us this summer: If you want to know more about these projects, follow the links and ask the students (and mentors)!

Next.

Previous.