Does it become harder to bootstrap Debian?
Sat, 13 Oct 2012 23:46 categories: debianMy last post explained how I retrieved and corrected data from snapshot.debian.org so that dose3 was able to parse it. In this post I will cover some surprising results I found when using my tools on those Packages and Sources files from 2005 until today.
For each pair of Packages and Sources files I did the following:
- created a reduced distribution
- calculated the dependency graph
I call a reduced distribution the smallest set of binary and source packages with the following properties:
- all source packages can be built with the available binary packages
- all binary packages are built from the available source packages
Creating a reduced distribution first, greatly increases the execution speed of my algorithms as it reduces the amount of binary and source packages by an order of magnitude while still preserving the dependency cycle situation of the core packages. In many cases, once the packages of a reduced distribution are available, all the rest of Debian can be compiled from them without any dependency cycles.
As also mentioned in earlier posts, there is always one central, big strongly connected component (SCC) in the dependency graph.
I am especially interested in how the size of the reduced distribution and the SCC change over time as both are an indication of:
- the amount of interdependencies between core packages
- the amount of dependency cycles in the dependency graph
Lets look at the plots I did from the data I gathered. The gray data points indicate that at that point in time, one or more of the core source packages (the ones in the reduced distribution) in Debian Sid was not compilable. This means that the resulting values cannot be fully trusted. But as it is mostly only a single source package that doesnt compile, it doesnt influence the overall result much and therefor I included them anyways. Red and green data points represent a fully successful run.
The only thing that I do not yet understand is what happened in 2007...
So while a potential porter in 2005 only had to look at a graph of 150 nodes, he now needs to solve a graph of nearly 1000 nodes. The amount of edges in the dependency graph grew even more dramatic from about 500 to over 8000 edges.
While the dependency situation for Debian Sid in 2005 can easily be printed using xdot and visually solved, this in not possible anymore in 2012.
While dependencies of only a few dozen source packages had to manually be dropped in 2005, now even dropping build dependencies from a few hundred source packages doesnt solve the dependency situation.
So my assumption is, that due to a growing amount of interdependencies between source and binary packages (as both gain more features), bootstrapping Debian for a new architecture becomes harder over time. Is this also the perceived subjective impression of people that ported Debian in the past?
If my assumption is correct, then there is a growing need for official support of droppable build dependencies (or "stage builds" or "profile builds") to break dependency cycles during the bootstrapping process. Work of a porter would be much easier if source packages would already contain information about what build dependencies can be dropped (if so needed). In the best case, a machine could use those annotations to calculate a build order automatically.
As one can see in the graph above, there are currently 370 source packages in the main SCC. This means that no more than this amount of packages (but probably much less) have to be annotated to break the SCC into a directed acyclic graph.
Discussion about what syntax to use to mark potentially droppable build dependencies currently happens in bug#661538 but should maybe be discussed by a wider audience. The currently favored solution was proposed in said bugreport by Guillem Jover and is called "build profiles". It has the advantage that it is not only trivial to implement (a patch exist for dpkg and dose3 already supports them) but would also be useful for other purposes like embedded builds. The format is similar to how architecture restrictions for individual dependencies are specified but uses "triangular brackets":
Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny
The work Patrick McDermott did for his GSoC project over the summer already uses above syntax.
analyzing Packages and Sources from snapshot.debian.org
Fri, 12 Oct 2012 21:11 categories: debianWhen I wanted to use my dependency graph analysis tools to analyze earlier states of Debian Sid, I naturally used snapshot.debian.org to retrieve the Packages and Sources files from which my tools retrieve the dependency information.
The problem is, that many of those Packages and Sources files contain syntax errors that make the dose3 parser choke. This leads to my tools being unable to parse the affected files.
The following script does not only download all Packages and Sources files in a five day interval (4460 MB from 2005/03/12 to 2012/10/11) but also cleans all the syntax errors that were not parsable by dose3. This includes invalid version naming, architecture lists separated by commas, disjunctions in Conflict fields and incorrect braces/bracket usage.
Maybe this helps others who also want to profit from Packages and Sources files from the past.
Fun fact #1: starting from June 2010, there were no more syntax errors in the Packages and Sources files of Debian Sid.
Fun fact #2: starting from December 2009, there are no more mismatches between versions of binary packages in the Packages file and the versions of the corresponding source packages in the Sources file.
Using Gentoo to find reduced build dependencies for Debian source packages
Wed, 10 Oct 2012 23:56 categories: debianAutomatically devising a build order that allows to bootstrap Debian, currently fails (amongst other reasons) because of the lack of metadata information about which build dependencies can potentially be dropped from source packages. If that information was available, an algorithm could decide which build dependencies to drop so that dependency cycles can be broken.
Finding droppable build dependencies of a source package is something only humans can do. This is because it involves to manually analyze and test the build system of a source package. Build systems are neither uniform nor do they encode their dependencies in a way that can directly be mapped to Debian packages. Therefor they are not machine readable.
One idea to solve the dilemma, is to find a Linux distribution that provides the following:
- allows to do "profile builds" of its source packages with different features enabled or disabled
- stores information about which feature requires which build dependency
- stores everything in a format that can be parsed and analyzed
- covers a similar range of software packages as Debian does
If such a distribution can be found then the information from it can be used to find dependencies that can also be dropped from Debian source packages.
Gentoo is a distribution that fulfills above requirements through so called USE flags that allow to enable or disable features during compilation. Dependencies of Gentoo source packages are stored in .ebuild files that control the build process. Since .ebuild files are bash scripts, parsing them is not trivial. I therefor used the emerge software package to extract that information. Thanks to the well written emerge code and to quick help in the Gentoo IRC channel, it didnt take long to make the code run on Debian. My sourcecode is downloadable here:
https://gitorious.org/debian-bootstrap/gen2deb
Before I list the results of using Gentoo USE flags to determine dependencies that can potentially be dropped from Debian source packages, let me list the problems that this method entails.
Only package name matching, no version matching
When writing the mapping from Debian to Gentoo packages and back I discard version information. There are just too many versions that either Debian or Gentoo have and are not present in the other. So the assumption is, that Debian Sid and Gentoo have both the most recent major versions of upstream software which has roughly the same requirements in terms of build dependencies.
Gentoo packages are matched to Debian source packages
In Gentoo there are only source packages and no binary packages. So I map Gentoo packages to Debian source packages. But Gentoo source packages build depend on other source packages while Debian source packages depend on binary packages. So at some point I have to translate Gentoo packages to Debian source packages and those source packages to Debian binary packages. I do this by analyzing the original binary package build dependencies of a Debian source package and then filter out those binary packages as being droppable that are built by the Debian source packages that were found to be droppable.
Not the exact same package set
There is some software that is only in Gentoo and some that is only in Debian. Debian and Gentoo also split some source packages differently.
Gentoo has more direct dependencies
Many build dependencies in Debian are indirectly pulled in through dependencies of direct build dependencies. In Gentoo source packages directly depend on most things they need to build successfully. This leads to the list of dependencies in Gentoo to be much larger than the list of dependencies in the corresponding Debian source package. It also means that lots of dependencies that can be dropped in Gentoo are not found to be droppable in Debian because they are not direct dependencies of that source package.
There are no implicit dependencies
Gentoo will often drop dependencies that are essential or build-essential packages in Debian and are therefor implicit build dependencies that cannot be dropped.
Result
Despite the many problems, the result doesnt look too wrong. I got some Debian source packages that were found to have droppable build dependencies from Thorsten Glaser and all dependencies that Gentoo found to be droppable were also dropped by him.
To put everything into numbers: the current 912 nodes big SCC in Debian Sid can be reduced to 6 individual SCC with 422, 5, 5, 3, 2 and 2 nodes each. So using Gentoo cuts the size of the central component to more than half.
Surely, there will be a number of dependencies that were found to be droppable in Gentoo but are actually not droppable in Debian. The point is, that it is better to have "some" data even if it contains false positives than no data at all. It is easier for a human to verify if some suggested droppable build dependencies are actually correct than going through hundreds of source packages with thousands of dependencies manually.
visit at Paris IRILL
Fri, 28 Sep 2012 10:44 categories: debianLast week, I was invited to give a talk about my Debian bootstrapping efforts at IRILL in Paris. The slides of my talk are online as pdf.
The time I spent in Paris with Pietro Abate was very fruitful. I have to thank him and Roberto di Cosmo for inviting me and even compensating for my travel expenses.
The things we actually managed to implement during my visit:
- removed huge chunks of code that were not needed anymore, making everything more concise and pretty: ended up removing over 1600 lines
- basebuildsystem.ml can now fill add-cross-sources.list with sources for debhelper (as debhelper is quite build-essential)
- start evaluating Gentoo as a source for reduced build dependencies
- add unit test skeleton and material
- compile with dose3 master
- add graphML output
- feed graphs into analysis tools for visualization
The two most important things (in my opinion) that we came up with for future implementation, were the idea to harvest reduced build dependency information from Gentoo as well as finding a flaw in the way the current dependency graph relates to binary packages and their installation sets.
I am still busy with evaluating output from my trials with Gentoo, so I will cover this topic in a later blog post once I generated the actual impact on the dependency graph. The current status is, that Gentoo USE flags allow me to find possibly droppable build dependencies for 250 out of the 350 interesting Debian source packages that are part of the main scc.
Pietro also found a current flaw in how the dependency graph is generated. While source package A and B might both depend on binary package C (and its installation set), it is wrong to add a dependency to C from both source packages without further verification. Due to virtual packages and disjunctive dependencies, C might have many possible installation sets. Only one of them is chosen in the current code. The problem is, that this one chosen set might conflict with the other build dependencies of source packages depending on C. Therefor, there must exist multiple binary package nodes C, each with a different installation set, dynamically generated as they are needed. Source packages must point to the node for C that possesses an installation set that doesnt conflict with its own build dependencies.
Other TODO notes that we came up with and that I will be implementing are:
- integrating dose3 as a git submodule
- create a proper build system
- try using a different cudf solver
- unit tests
- finally coming up with a name (suggestions welcome - I'm bad at name-finding)
- building a Debian package (depends on having a name first)
- formalize/visualize/document the current algorithms
- to break the main scc, use additional heuristics like:
- the order induced by reduced_dist to classify nodes in the graph
- centrality/distance in graph
- comparing scc of different Debian snapshots with each other
As I cannot always depend on new dose3 versions being pushed to Debian Sid right after their release, I will do the dose3 git submodule integration over the weekend. This will allow me to evaluate the results I got from my evaluation of Gentoo USE flags that I gathered over the past week.
Bootstrappable Debian - How to help
Wed, 08 Aug 2012 09:03 categories: debianTLDR: multiarch, multiarch, multiarch, cross buildability, staged build dependencies, wiki page, corrections/hints/requests to debian-bootstrap at lists.mister-muffin.de
This summer (and this year's GSoC) is nearing its end and to make it easier for people to make use of the information my tools produced so far, I created a page in the Debian wiki. It lists not only the open issues I see but also statistics that I gathered using the output of my GSoC project. I want to use this blog post to make people aware of that page as well as to get some feedback on it and anything related to it.
The biggest blocker my tools face, is that many packages are still missing multiarch information. As long as at least the basic packages do not have their cross build dependencies satisfied via multiarch for an existing foreign architecture, automated tools can not properly analyze the dependency situation in the bootstrapping case, when many packages of the new foreign architecture do not even exist yet.
If Debian is supposed to be bootstrappable, then the first stage is to make a set of basic packages cross compile for an existing foreign architecture. Once this is possible, a tool of mine can analyze the cyclic build dependency situation that might occur when cross compiling for an architecture that does not exist yet. Then, staged cross builds can be used to cross compile a minimal foreign system. Due to missing multiarch classification, it is not known yet how big the cyclic build dependency situation is for the base packages.
It is not only the conversion of packages to multiarch that is needed but also the adding of the :any (and rare cases :native) qualifier to build dependencies on M-A: allowed packages. Prominent build dependencies that should (but are not yet) be M-A: allowed are python and gettext. Both are needed as a build dependency by many packages of the base system.
Unfortunately wanna-build does not understand qualifiers like :any and :native yet. Until it does, no package can be marked :any or :native and cross compilation of many base packages can not succeed.
Once the point is reached, where a base system can be cross compiled from nothing, native compilation can start. Since native compilation doesnt depend on multiarch, the dependency situation when trying to natively compiling all of Debian from nothing is understood much better. Unfortunately, the cyclic build dependency situation is also much worse in the native case and there exists a big 1000 node strongly connected component of binary and source packages that all interdepend on each other.
This dependency mess can be solved using three approaches:
- using Build-Depends-Indep
- stage building
- cross compilation
The wiki page gives many hints on how to find packages that each method can be applied to.
Stage building is a tool that might be useful for cross building (we dont know for sure yet) but is definitely needed for native compilation. It is needed for native compilation because after all possible dependencies are moved to Build-Depends-Indep, the only other alternative to stage building for breaking dependency cycles is to cross build source packages. Since building a package without one of its build dependencies "staged" is often much easier than making the package in question cross compile, it is a preferred alternative. Once more packages have been made multiarch, it might be possible to prove that there is no alternative to introducing a notion of staged builds.
Some people (wookey, Patrick McDermott, Guillem Jover, myself) decided that the following format to mark staged build dependencies would be preferred over others:
Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny
The <> format was proposed by Guillem Jover in bug#661538. Patches for dpkg and dose3 are done. More people need to discuss about this format for a final decision on how to indicate staged build dependencies.
For more information on the topic, have a look at the corresponding wiki page. Feel free to direct any comments/critique/hints to debian-bootstrap at lists.mister-muffin.de or directly to me.