Bootstrappable Debian FOSDEM talk

categories: debian

I will give a talk about the current status of automating Debian bootstrapping at FOSDEM 2013. There will also be a live demo of the developed toolchain. The talk will start 16:30 on Saturday, 2 February 2013 in room H.1302 as part of the Cross distro devroom track and will be half an hour long. Directly after my talk, at 17:00, Wookey will give his about bootstrapping Debian/Ubuntu for arm64 in the same location.

The current version of the slides can be downloaded here. The latex beamer source is available in the git repository of the debian bootstrap project.

If you can't make it to FOSDEM, then my last post about the topic gives away most of the things I will be talking about.

View Comments

Bootstrappable Debian - New Milestone

categories: debian

This post is about the port bootstrap build ordering tool (naming suggestions welcome) which was started as a Debian Google Summer of Code project in 2012 and continued to be developed afterwards. Sources are available through gitorious.

In the end of November 2012, I managed to put down an approximation algorithm to the feedback arc set problem which allowed to break the dependency graph into a directed acyclic graph with only few removed build dependencies. I wrote about this effort on our mailinglist but didnt mention it here because it was still too much of a proof-of-concept. Later, in January 2013, I mentioned the result of this algorithm in an email wookey and me wrote to debian-devel mailinglist.

Many things happened since November 2012:

Processing pipeline instead of monolithic tools

The tools I developed so far tried to accomplish everything by themselves, reusing functionality implemented in a central library. Therefor, if one wanted to try out even trivial new things, it mostly meant to hack some OCaml code. Pietro Abate suggested to instead develop smaller tools which could work independently of each other, would only execute one algorithm each and could easily be connected together in different ways to achieve different effects.

This switch is now done and all functionality from the old tools is moved into a new toolset. The exchange format between the tools is either plain text files in deb822 control format (Packages/Sources files) or a dependency graph. The dependency graph is currently marshalled by OCaml but future versions should work with just passing a GraphML (an XML graph format) representation around.

This new way of doing things seems close to the UNIX philosophy (each program does one thing well, data is stored as text, every program is a filter). For example the deb822 control output can easily be manipulated using grep-dctrl(1) and there exist many tools which can read, analyze and manipulate GraphML. It is a big improvement over the old, monolithic tools which did not allow to manipulate any intermediate result by external, existing tools. Currently, a shell script (native.sh) will execute all tools in a meaningful order, connecting them together correctly. The same tools will be used for a future cross.sh but they will be connected differently.

I wrote about a first proposal of what the individual tools should do and how they should be connected in this email from which I also linked a confusing overview of the pipeline. This overview has recently been improved to be even more confusing and the current version of the lower half (the native part) now looks like this:

native reduced

Solid arrows represent a flow of binary packages, dottend arrows represent a flow of source packages, ovals represent a set of packages, boxes with rounded corners represent set operations, rectangular boxes represent filters. There is only one input to a filter, which is the arrow connected to the top of the box. Outgoing arrows from the bottom represent the filtered input. Ingoing arrows to either side are arguments to the filter and control how the filter behaves depending on the algorithm. I will explain this better once the pipeline proved to be less in flux. The pipeline is currently executed like this by native.sh.

Two new ways to break dependency cycles have been discovered

So far, we knew of three ways to formally break dependency cycles:

  • remove build dependencies through build profiles
  • find out that the build dependency is only used to build arch:all packages and therefor put it into Build-Depends-Indep
  • cross compile some source packages

Two new methods can be added to the above:

Dependency graph definition changed

Back in September when I was visiting irill, Pietro found a flaw in how the dependency graph used to be generated. He supplied a new definition of the dependency graph which does away with the problem he found. After fixing some small issues with his code, I changed all the existing algorithms to use the new graph definition. The old graph is as of today removed from the repository. Thanks to Pietro for supplying the new graph definition - I must still admit that my OCaml foo is not strong enough to have come up with his code.

Added complexity for profile built source packages

As mentioned in the introduction, wookey and me addressed the debian-devel list with a proposal on necessary changes for an automated bootstrapping of Debian. During the discussiong, two important things came up which are to be considered by the dependency graph algorithms:

  • profile built source packages may not create all binary packages
  • profile built source packages may need additional build dependencies

Both things make it necessary to alter the dependency graph during the generation of a feedback arc set and feedback vertex set beyond the simple removal of edges. Luckily, the developed approximation algorithms can be extended to support such changes in the graph.

Different feedback arc set algorithms

The initially developed feedback arc set algorithm is well suited to discover build dependency edges which should be dropped. It performs far worse when creating the final build order because it only considers edges by itself and not how many edges of a source package can be dropped by profile building it.

The adjusted algorithm for generating a build order is more of a feedback vertex set algorithm because instead of greedily finding the edge with most cycles through it, it greedily finds the source package which would break most cycles if it was profile built.

Generating a build order with less profile built source packages

After implementing all the features above I now feel more confident to publish the current status of the tools to a wider audience.

The following test shows a run of the aforementioned ./native.sh shell script. Its final output is a list of source packages which have to be profile built and a build order. Using the resulting build order, starting from a minimal build system (essential:yes, build-essential and debhelper), all source packages will be compiled which are needed to compile all binary packages in the system. The result will therefor be a list of source and binary packages which fulfill the following property:

  • all binary packages can be built from the available source packages
  • all source packages can be built with the available binary packages

I called this a "reduceded distribution" in earlier posts. The interesting property of this specific selection is, that it contains the biggest problem set of Debian when bootstrapping it: a 900 to 1000 nodes big strongly connected component. Here is a visualization of the problem:

hideous mess

Source packages do not yet come with build profiles and the cross build situation can not yet be analyzed, so the following assumptions were made:

The last point is about 14 build dependencies which were decided to be broken by the feedback arc set algorithm but for which other data sources did not indicate that they are actually breakable. Those 14 are dynamically generated by native.sh.

If above assumptions should not be too far from the actual situation, then not more than 73 source packages have to be modified to bootstrap a reduced distribution. This reduced distribution even includes dependency-wise "big" packages like webkit, metacity, iceweasel, network-manager, tracker, gnome-panel, evolution-data-server, kde-runtime, libav and nautilus. By changing one line in native.sh one could easily develop a build order which generates gnome-desktop or really any given (meta-)package selection.

All of native.sh takes only 80 seconds to execute on my system (Core i5, 2.5 GHz, singlethreaded). Here is the final build order which creates 2044 binary packages from 613 source packages.

  1. nspr, libio-pty-perl, libmcrypt, unzip, libdbi-perl, cdparanoia, libelf, c-ares, liblocale-gettext-perl, libibverbs, numactl, ilmbase, tbb, check, libogg, libatomic-ops, libnl($), orc($), libaio, tcl8.4, kmod, libgsm, lame, opencore-amr, tcl8.5, exuberant-ctags, mhash, libtext-iconv-perl, libutempter, pciutils, gperf, hspell, recode, tcp-wrappers, fdupes, chrpath, libbsd, zip, procps, wireless-tools, cpufrequtils, ed, libjpeg8, hesiod, pax, less, dietlibc, netkit-telnet, psmisc, docbook-to-man, libhtml-parser-perl, libonig, opensp($), libterm-size-perl, linux86, libxmltok, db-defaults, java-common, sharutils, libgpg-error, hardening-wrapper, cvsps, p11-kit, libyaml, diffstat, m4
  2. openexr, enca, help2man, speex, libvorbis, libid3tag, patch, openjade1.3, openjade, expat, fakeroot, libgcrypt11($), ustr, sysvinit, netcat, libirman, html2text, libmad, pth, clucene-core, libdaemon($), texinfo, popt, net-tools, tar, libsigsegv, gmp, patchutils($), dirac, cunit, bridge-utils, expect, libgc, nettle, elfutils, jade, bison
  3. sed, indent, findutils, fastjar, cpio, chicken, bzip2, aspell, realpath, dctrl-tools, rsync, ctdb, pkg-config($), libarchive, gpgme1.0, exempi, pump, re2c, klibc, gzip, gawk, flex-old, original-awk, mawk, libtasn1-3($), flex($), libtool
  4. libcap2, mksh, readline6, libcdio, libpipeline, libcroco, schroedinger, desktop-file-utils, eina, fribidi, libusb, binfmt-support, silgraphite2.0, atk1.0($), perl, gnutls26($), netcat-openbsd, ossp-uuid, gsl, libnfnetlink, sg3-utils, jbigkit, lua5.1, unixodbc, sqlite, wayland, radvd, open-iscsi, libpcap, linux-atm, gdbm, id3lib3.8.3, vo-aacenc, vo-amrwbenc, fam, faad2, hunspell, dpkg, tslib, libart-lgpl, libidl, dh-exec, giflib, openslp-dfsg, ppl($), xutils-dev, blcr, bc, time, libdatrie, libpthread-stubs, guile-1.8, libev, attr, libsigc++-2.0, pixman, libpng, libssh2, sqlite3, acpica-unix, acl, a52dec
  5. json-c, cloog-ppl, libverto, glibmm2.4, rtmpdump, libdbd-sqlite3-perl, nss, freetds, slang2, libpciaccess, iptables, e2fsprogs, libnetfilter-conntrack, bash, libthai, python2.6($), libice($), libpaper, libfontenc($), libxau, libxdmcp($), openldap($), cyrus-sasl2($), openssl, python2.7($)
  6. psutils, libevent, stunnel4, libnet-ssleay-perl, libffi, readline5, file, libvoikko, gamin, libieee1284, build-essential, libcap-ng($), lcms($), keyutils, libxml2, libxml++2.6, gcc-4.6($), binutils, dbus($), libsm($), libxslt, doxygen($)
  7. liblqr, rarian, xmlto, policykit-1($), libxml-parser-perl, tdb, devscripts, eet, libasyncns, libusbx, icu, linux, libmng, shadow, xmlstarlet, tidy, gavl, flac, dbus-glib($), libxcb, apr, krb5, alsa-lib
  8. usbutils, enchant, neon27, uw-imap, libsndfile, yasm, xcb-util, gconf($), shared-mime-info($), audiofile, ijs($), jbig2dec, libx11($)
  9. esound, freetype, libxkbfile, xvidcore, xcb-util-image, udev($), libxfixes, libxext($), libxt, libxrender
  10. tk8.4, startup-notification, libatasmart, fontconfig, libxp, libdmx, libvdpau, libdrm, libxres, directfb, libxv, libxxf86dga, libxxf86vm, pcsc-lite, libxss, libxcomposite, libxcursor, libxdamage, libxi($), libxinerama, libxrandr, libxmu($), libxpm, libxfont($)
  11. xfonts-utils, libxvmc, xauth, libxtst($), libxaw($)
  12. pmake, corosync, x11-xserver-utils, x11-xkb-utils, coreutils, xft, nas, cairo
  13. libedit, cairomm, tk8.5, openais, pango1.0($)
  14. ocaml, blt, ruby1.8, firebird2.5, heimdal, lvm2($)
  15. cvs($), python-stdlib-extensions, parted, llvm-2.9, ruby1.9.1, findlib, qt4-x11($)
  16. xen, mesa, audit($), avahi($)
  17. x11-utils, xorg-server, freeglut, libva
  18. jasper, tiff3, python3.2($)
  19. openjpeg, qt-assistant-compat, v4l-utils, qca2, jinja2, markupsafe, lcms2, sip4, imlib2, netpbm-free, cracklib2, cups($), postgresql-9.1
  20. py3cairo, pycairo, pam, libgnomecups, gobject-introspection($)
  21. gdk-pixbuf, libgnomeprint, gnome-menus, gsettings-desktop-schemas, pangomm, consolekit, vala-0.16($), colord($), atkmm1.6
  22. libgee, gtk+3.0($), gtk+2.0($)
  23. gtkmm2.4, poppler($), openssh, libglade2, libiodbc2, gcr($), libwmf, systemd($), gcj-4.7, java-atk-wrapper($)
  24. torque, ecj($), vala-0.14, gnome-keyring($), gcc-defaults, gnome-vfs($)
  25. libidn, libgnome-keyring($), openmpi
  26. mpi-defaults, dnsmasq, wget, lynx-cur, ghostscript, curl
  27. fftw3, gnupg, libquvi, xmlrpc-c, raptor, liboauth, groff, fftw, boost1.49, apt
  28. boost-defaults, libsamplerate, cmake, python-apt
  29. qjson, qtzeitgeist, libssh, qimageblitz, pkg-kde-tools, libical, dwarves-dfsg, automoc, attica, yajl, source-highlight, pygobject, pygobject-2, mysql-5.5($)
  30. libdbd-mysql-perl, polkit-qt-1, libdbusmenu-qt, raptor2, dbus-python, apr-util
  31. rasqal, serf, subversion($), apache2
  32. git, redland
  33. xz-utils, util-linux, rpm, man-db, make-dfsg, libvisual, cryptsetup, libgd2, gstreamer0.10
  34. mscgen, texlive-bin
  35. dvipng, luatex
  36. libconfig, transfig, augeas, blas, libcaca, autogen, libdbi, linuxdoc-tools, gdb, gpm
  37. ncurses, python-numpy($), rrdtool, w3m, iproute, gcc-4.7, libtheora($), gcc-4.4
  38. libraw1394, base-passwd, lm-sensors, netcf, eglibc, gst-plugins-base0.10
  39. libiec61883, qtwebkit, libvirt, libdc1394-22, net-snmp, jack-audio-connection-kit($), bluez
  40. redhat-cluster, gvfs($), pulseaudio($)
  41. phonon, libsdl1.2, openjdk-6($)
  42. phonon-backend-gstreamer($), gettext, libbluray, db, swi-prolog, qdbm, swig2.0($)
  43. highlight, libselinux, talloc, libhdate, libftdi, libplist, python-qt4, libprelude, libsemanage, php5
  44. samba, usbmuxd, libvpx, lirc, bsdmainutils, libiptcdata, libgtop2, libgsf, telepathy-glib, libwnck3, libnotify, libunique3, gnome-desktop3, gmime, glib2.0, json-glib, libgnomecanvas, libcanberra, orbit2, udisks, d-conf, libgusb
  45. libgnomeprintui, libimobiledevice, nautilus($), libbonobo, librsvg
  46. evas, wxwidgets2.8, upower, gnome-disk-utility, libgnome
  47. ecore, libbonoboui
  48. libgnomeui
  49. graphviz
  50. exiv2, libexif, lapack, soprano, libnl3, dbus-c++
  51. atlas, libffado, graphicsmagick, libgphoto2, network-manager
  52. pygtk, jackd2, sane-backends, djvulibre
  53. libav($), gpac($), ntrack, python-imaging, imagemagick, dia
  54. x264($), matplotlib, iceweasel
  55. strigi, opencv, libproxy($), ffms2
  56. kde4libs, frei0r, glib-networking
  57. kde-baseapps, kate, libsoup2.4
  58. geoclue, kde-runtime, totem-pl-parser, libgweather, librest, libgdata
  59. webkit
  60. zenity, gnome-online-accounts
  61. metacity, evolution-data-server
  62. gnome-panel
  63. tracker

The final recompilation of profile built source packages is omitted. Source packages marked with a ($) are selected to be profile built. All source packages listed in the same line can be built in parallel as they do not depend upon each other.

This order looks convincing as it first compiles a multitude of source packages which have no or only few build dependencies lacking. Later steps allow fewer source packages to be compiled in parallel. The amount of needed build dependencies is highest in the source packages that are built last.

View Comments

Does it become harder to bootstrap Debian?

categories: debian

My last post explained how I retrieved and corrected data from snapshot.debian.org so that dose3 was able to parse it. In this post I will cover some surprising results I found when using my tools on those Packages and Sources files from 2005 until today.

For each pair of Packages and Sources files I did the following:

  1. created a reduced distribution
  2. calculated the dependency graph

I call a reduced distribution the smallest set of binary and source packages with the following properties:

  • all source packages can be built with the available binary packages
  • all binary packages are built from the available source packages

Creating a reduced distribution first, greatly increases the execution speed of my algorithms as it reduces the amount of binary and source packages by an order of magnitude while still preserving the dependency cycle situation of the core packages. In many cases, once the packages of a reduced distribution are available, all the rest of Debian can be compiled from them without any dependency cycles.

As also mentioned in earlier posts, there is always one central, big strongly connected component (SCC) in the dependency graph.

I am especially interested in how the size of the reduced distribution and the SCC change over time as both are an indication of:

  • the amount of interdependencies between core packages
  • the amount of dependency cycles in the dependency graph

Lets look at the plots I did from the data I gathered. The gray data points indicate that at that point in time, one or more of the core source packages (the ones in the reduced distribution) in Debian Sid was not compilable. This means that the resulting values cannot be fully trusted. But as it is mostly only a single source package that doesnt compile, it doesnt influence the overall result much and therefor I included them anyways. Red and green data points represent a fully successful run.

The only thing that I do not yet understand is what happened in 2007...

So while a potential porter in 2005 only had to look at a graph of 150 nodes, he now needs to solve a graph of nearly 1000 nodes. The amount of edges in the dependency graph grew even more dramatic from about 500 to over 8000 edges.

While the dependency situation for Debian Sid in 2005 can easily be printed using xdot and visually solved, this in not possible anymore in 2012.

While dependencies of only a few dozen source packages had to manually be dropped in 2005, now even dropping build dependencies from a few hundred source packages doesnt solve the dependency situation.

So my assumption is, that due to a growing amount of interdependencies between source and binary packages (as both gain more features), bootstrapping Debian for a new architecture becomes harder over time. Is this also the perceived subjective impression of people that ported Debian in the past?

If my assumption is correct, then there is a growing need for official support of droppable build dependencies (or "stage builds" or "profile builds") to break dependency cycles during the bootstrapping process. Work of a porter would be much easier if source packages would already contain information about what build dependencies can be dropped (if so needed). In the best case, a machine could use those annotations to calculate a build order automatically.

As one can see in the graph above, there are currently 370 source packages in the main SCC. This means that no more than this amount of packages (but probably much less) have to be annotated to break the SCC into a directed acyclic graph.

Discussion about what syntax to use to mark potentially droppable build dependencies currently happens in bug#661538 but should maybe be discussed by a wider audience. The currently favored solution was proposed in said bugreport by Guillem Jover and is called "build profiles". It has the advantage that it is not only trivial to implement (a patch exist for dpkg and dose3 already supports them) but would also be useful for other purposes like embedded builds. The format is similar to how architecture restrictions for individual dependencies are specified but uses "triangular brackets":

Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny

The work Patrick McDermott did for his GSoC project over the summer already uses above syntax.

View Comments

analyzing Packages and Sources from snapshot.debian.org

categories: debian

When I wanted to use my dependency graph analysis tools to analyze earlier states of Debian Sid, I naturally used snapshot.debian.org to retrieve the Packages and Sources files from which my tools retrieve the dependency information.

The problem is, that many of those Packages and Sources files contain syntax errors that make the dose3 parser choke. This leads to my tools being unable to parse the affected files.

The following script does not only download all Packages and Sources files in a five day interval (4460 MB from 2005/03/12 to 2012/10/11) but also cleans all the syntax errors that were not parsable by dose3. This includes invalid version naming, architecture lists separated by commas, disjunctions in Conflict fields and incorrect braces/bracket usage.

Maybe this helps others who also want to profit from Packages and Sources files from the past.

Fun fact #1: starting from June 2010, there were no more syntax errors in the Packages and Sources files of Debian Sid.

Fun fact #2: starting from December 2009, there are no more mismatches between versions of binary packages in the Packages file and the versions of the corresponding source packages in the Sources file.

View Comments

Using Gentoo to find reduced build dependencies for Debian source packages

categories: debian

Automatically devising a build order that allows to bootstrap Debian, currently fails (amongst other reasons) because of the lack of metadata information about which build dependencies can potentially be dropped from source packages. If that information was available, an algorithm could decide which build dependencies to drop so that dependency cycles can be broken.

Finding droppable build dependencies of a source package is something only humans can do. This is because it involves to manually analyze and test the build system of a source package. Build systems are neither uniform nor do they encode their dependencies in a way that can directly be mapped to Debian packages. Therefor they are not machine readable.

One idea to solve the dilemma, is to find a Linux distribution that provides the following:

  • allows to do "profile builds" of its source packages with different features enabled or disabled
  • stores information about which feature requires which build dependency
  • stores everything in a format that can be parsed and analyzed
  • covers a similar range of software packages as Debian does

If such a distribution can be found then the information from it can be used to find dependencies that can also be dropped from Debian source packages.

Gentoo is a distribution that fulfills above requirements through so called USE flags that allow to enable or disable features during compilation. Dependencies of Gentoo source packages are stored in .ebuild files that control the build process. Since .ebuild files are bash scripts, parsing them is not trivial. I therefor used the emerge software package to extract that information. Thanks to the well written emerge code and to quick help in the Gentoo IRC channel, it didnt take long to make the code run on Debian. My sourcecode is downloadable here:

https://gitorious.org/debian-bootstrap/gen2deb

Before I list the results of using Gentoo USE flags to determine dependencies that can potentially be dropped from Debian source packages, let me list the problems that this method entails.

Only package name matching, no version matching

When writing the mapping from Debian to Gentoo packages and back I discard version information. There are just too many versions that either Debian or Gentoo have and are not present in the other. So the assumption is, that Debian Sid and Gentoo have both the most recent major versions of upstream software which has roughly the same requirements in terms of build dependencies.

Gentoo packages are matched to Debian source packages

In Gentoo there are only source packages and no binary packages. So I map Gentoo packages to Debian source packages. But Gentoo source packages build depend on other source packages while Debian source packages depend on binary packages. So at some point I have to translate Gentoo packages to Debian source packages and those source packages to Debian binary packages. I do this by analyzing the original binary package build dependencies of a Debian source package and then filter out those binary packages as being droppable that are built by the Debian source packages that were found to be droppable.

Not the exact same package set

There is some software that is only in Gentoo and some that is only in Debian. Debian and Gentoo also split some source packages differently.

Gentoo has more direct dependencies

Many build dependencies in Debian are indirectly pulled in through dependencies of direct build dependencies. In Gentoo source packages directly depend on most things they need to build successfully. This leads to the list of dependencies in Gentoo to be much larger than the list of dependencies in the corresponding Debian source package. It also means that lots of dependencies that can be dropped in Gentoo are not found to be droppable in Debian because they are not direct dependencies of that source package.

There are no implicit dependencies

Gentoo will often drop dependencies that are essential or build-essential packages in Debian and are therefor implicit build dependencies that cannot be dropped.

Result

Despite the many problems, the result doesnt look too wrong. I got some Debian source packages that were found to have droppable build dependencies from Thorsten Glaser and all dependencies that Gentoo found to be droppable were also dropped by him.

To put everything into numbers: the current 912 nodes big SCC in Debian Sid can be reduced to 6 individual SCC with 422, 5, 5, 3, 2 and 2 nodes each. So using Gentoo cuts the size of the central component to more than half.

Surely, there will be a number of dependencies that were found to be droppable in Gentoo but are actually not droppable in Debian. The point is, that it is better to have "some" data even if it contains false positives than no data at all. It is easier for a human to verify if some suggested droppable build dependencies are actually correct than going through hundreds of source packages with thousands of dependencies manually.

View Comments
« Older Entries -- Newer Entries »