I added the first version of basenocycles.ml to git. Given an initial set of cross built packages, it tries to compile as much as possible on the resulting system in multiple rounds.
During June 3, I discovered and error in my program that would only come up when using the Debian Sid package lists as the input:
Fatal error: exception Assert_failure("common/edosSolver.ml", 610, 11)
On this day, June 5, I wrote a minimal test case for this problem.
The same day, Pietro figured out that this is a bug in dose which will be fixed in the next release.
Begin writing code to figure out how important a binary package is for the further build process.
Try to use Depsolver.edos_install to find out what packages are needed to make debhelper available.
Restructure basenocycles.ml, exclude source packages that already have been built, still trouble with already existing binary packages and Cudf.mem_installed, comment stuff better.
I wrote some crude code (only estimate, not correct, fixed later) that would give a rough overview of how often a given binary package is directly required as a build dependency.
Debhelper came out as the most needed package. It is architecture:all, so it does not have to be built but it has unfulfilled runtime dependencies. To make those packages available, 13 (actually 11, fixed later) packages have to be compiled on Ubuntu Natty. But those packages all (except gettext) require debhelper itself to be built. The first dependency cycle.
This dependency cycle (actually, the 12 cycles) can be broken by either cross compiling those source packages or by making them build without debhelper. One goal of the program is to help decide what the easier option is, but this is not yet implemented.
To play around a bit, I created the possibility to specify a list of packages that are additionally to the minimal set of cross compiled packages also cross compiled. I added the 13 packages found above to the list, thus making the binary packages they build available. This made debhelper available in the system.
As a result, 1625 out of 3339 source packages can be built with just a minimal build system (priority:essential packages plus build-essential) and debhelper available.
The next package that blocks the most other source packages from being built is cdbs. The next nine packages in that list also require cdbs so it seems to be the next important package to be available.
Pietro's suggestions make me:
- do not open BootstrapCommon but ExtLib, Common, Algo, Debian - do proper option parsing and logging - use Debcudf.ignore_essential = true - do Debcudf.init_tables (binlist@srclist) - use @ with shorter list first - use more List.rev_append instead of @ - use CudfAdd.who_provides to find if a package is available
I try to figure out how to get a list of packages that are missing to make a package installable/buildable. This functionality should be provided in dose but I fail to find it.
Lacking a solution of the problem of June 7, I write a mail to Pietro.
I start my first graphs in ocaml using the ocamlgraph library.
The graph I generate, starts off at a binary package. For each binary package it connects new vertices as its runtime dependencies. If a binary package is not arch:all and also not yet otherwise compiled, its source package is also added.
The result is a graph in which set of source packages in it will make the initial package available, if those source packages would be cross compiled.
The graph is extended further than the source packages.
I refine a couple of functions, make univ_get_pkg_by_name return the package with the highest version number.
I wrote a rather lengthy (1027 words) email to the list that explains my status as of this day.
I can create graphviz dot files with ocaml, can create node and edge types and create the graph by an imperative pattern that I saw a lot in Pietro's code.
Disjunctions are not yet handled correctly (see mail from June 8).
The graphs generated look like the following: http://mister-muffin.de/p/8nyc.png
I write a test case which shows how CudfAdd.who_provides doesnt find virtual packages.
Automate the process of finding the packages that, if cross compiled, would make another package available.
Add more predicates (identifying source packages) and improve input file reading code.
Move build_compile_rounds which compiles as many source packages as it can in multiple rounds on a given minimal system a toplevel function and thereby abstract it more.
Create a rudimentary text based menu to choose different actions to take for an analysis.
Start writing an extended version of simple_dependency_graph for deeper analysis.
Use xdot to show graphs from the text menu. Allow saving those graphs to a file.
Move functionality from the extended version of simple_dependency_graph over to the normal version and delete the extended version.
Add the new Closure vertex type.
Create extended_dependency_graph which is supposed to not contain single binary package vertices but handle a package and its installation set as one vertex.
The output of extended_dependency_graph is optionally reduced to the biggest (non degenerate) strongly connected component.
User gets the option of choosing the exploration depth.
Pietro replies to my email from June 11 and explains how the effect I see is due to "a nuisance of the debian to cudf encoding". As a result I change my code accordingly.
Another lengthy (1130 words) email to the list. I explain what was done in the past days, what parts work and how they work. I list some rationales on why I did things the way I did them.
The most important observation is, that after improving my code again and again, I ended up representing the dependency cycle problem in the same (very similar) way that Pietro suggested in the beginning. This is probably a good discovery.
Lots of data of that email is now of only little use as of June 16, I make lots of improvements in correctness.
As I dont have an answer to my other email to Pietro from June 13, I implement a very crude way to get an answer to the question of what packages are missing for a package to be available/compileable. I called it flatten_vpkgformula_best_effort and it suffers from many faults including disjunctions and package conflicts.
Patrick spots a problem. As a result, I make sure that at no point, the source package of an arch:all package can be listed.
As a reply to my mail from June 13, Pietro creates a new branch in the git and adds the code I needed to get a proper installation set.
As a result of Pietro's efforts from June 15, I make great advancements on all fronts.
Details of the current status follow in the next section.
A big leap was made on June 16 due to Pietro's great help on making me understand how Depsolver.listcheck can be used for my purposes. My difficulties in finding the solution myself are rooted in many parts of the dose framework being poorly commented but Pietro did already a couple of documentation commits whenever things were unclear for me.
Using Depsolver.listcheck makes it possible to be more distribution agnostic and I dont have to handle vpkgs, virtual packages and constraints myself anymore. The code also doesnt suffer anymore by wrongly analyzed dependencies and conflicts. The only thing that is not yet taken care of, is that Depsolver.listcheck only chooses one out of several possible installation set. A final version should be able to take into account that a different installation set could provide a better solution.
Overall, in comparison to two weeks ago, I can now properly build, traverse and analyze graphs, can choose an installation set properly, understand more about dependencies, closures, dose and ocaml in general.
Finding the importance of binary packages for building
When calculating how many source packages are depending on the availability of a binary package I originally flattened the pkg.Cudf.depends list twice for a rough overview. This is of course wrong due to disjunctions and conflicts and also doesnt provide a deep dependency inspection. The new method is to calculate an installation set that is necessary to compile a source package for every source package. The resulting list of binary packages is then used to find out how often a binary package appears in an installation set.
I see three drawbacks though:
- calculating an installation set for each source package in the archive is very slow
- if X packages build depend on A then also X packages will build depend on the installation set of A, resulting in lots of duplication
- only one installation set is selected though there are many
Removing simple graph
The simple graph which contained single binary and source packages was removed. I realized it doesnt really serve any purpose to look at it. As a result, Bin vertices and InstallDep edges are also not part of the graph anymore. Since it was responsible for generating the list of source packages that have to be cross built to make a package available, I created a new function get_cross_source_packages which uses an installation to serve the same purpose.
extended_dependency_graph now uses installation sets for generating the list of packages that is needed to compile a source package or install a binary package. The list of build dependencies does not include packages that are already installable. The list of runtime dependencies does not include packages that are otherwise available (cross built, arch:all...). Instead of checking for list membership all the time, I created hash tables for the list of installable as well as for the list of available binary packages.
There are two big tasks for the next two weeks:
Task one is to find a way to give hints on which packages to best consider for having reduced build dependencies. This would then probably finally make use of Pietro's cycle algorithms.
Task two is to find a way to break cycles and create a build-DAG from a list of packages that already have staged build dependency information.
Patrick is currently working on patching dpkg with Build-Depends-StageN dependencies as making perl cross compilable. If he doesnt need the ability to decide which packages to modify to have staged build dependencies in the near future, then task one is probably less urgent and therefor of less importance right now?
On the other hand, I can easily generate fake reduced build dependencies so that doing task two right now would not be a problem. Also, having the solution for task two will make it possible to show the user what effect it would have to add reduced build dependencies to a package.
For the reasons above (it's not urgent, task one profits from task two being solved) I will go and implement task two first (if there are no objections from my mentors).
Another idea, that I discussed with wookey and Patrick yesterday, was that due to multiarch being used for more and more packages, there should exist a set of packages that is cross compilable without any change to the package.
We agreed that I make a list of packages that, if cross compiled, would break dependency cycles and make other packages available. I created such a list of about 160 packages for Ubuntu Natty that, if cross compiled, made it possible to have 87% of Natty available (those numbers have to be treated with caution as I did not yet use the proper way of installation sets when generating that list, but the order of magnitude should be correct). Wookey can then try to cross compile those packages. If some packages of those "crucial" source packages are found to be cross compilable, then they should be cross compiled because it means that no work has to be done to break some cycles. Cross compiling all packages that are cross compilable out of the box is no solution, as only natively compiled packages can go into the archive. This is why the list of potentially additionally cross compiled source packages has to be kept as small as possible.
- Cloned Dose3 and made it build
- Retrieved bootstrap.ml and bootstrap2.ml from old revisions as they were deleted
- Compiled, tested and investigated the functionality of bootstrap.ml and bootstrap2.ml on a theoretical level as no test data was available
- Pietro sends me a tarball with his current version of bootstrap.ml and dummy as well as real test data
- Created a gitorious account, project and repository
- Compiled, tested and investigated his code
- Ran into several runtime problems with the supplied dummy examples
- Created Makefile to automatically fill ./examples/real/
- Found that .dot files are too big to be rendered
- Trying to figure out how hints work, how base-system was generated and why execution takes hours
- Pietro made examples work which let me understand the code much more
- Improvement of .dot output and output formatting
- Refactored code into bootstrapCommon.ml for shared functionality and bootstrap.ml for option parsing and main()
- Play with xdeb.py
- Generate dot graphs with bootstrap.ml and analyze them with sccmap
- Try to find a way to have a reduced package selection other than main archives of ubuntu/debian
- Initial work on trying to find the list of minimal source packages that have to be cross compiled
- Create email@example.com mailinglist
- Implement a replacement for apt-rdepends and grep-dctrl functionality in ocaml, both working on Package files
- Retrieve list of packages with priority:required
- Retrieve their runtime dependencies
- Retrieve the packages that are added with build-essential and dependencies
- Retrieve the list of source packages that are needed to build the above
- Retrieve list of binary packages that are build from the source packages in addition
- some more functionality in the Makefile
- Depsolver.dependency_closure replaces homebrew functionality in a better and faster way
- Only consider those binary packages that can actually be installed, given the limited amount of available packages using Depsolver.edos_install
- Create proper list diff by correctly comparing Cudf.package members
- Big code restructuring
- consider arch:all packages to be available by default
- Got helpful sourcecode comments by Pietro
- Use Depsolver.trim to reduce a universe to the installable packages
- Compile with dose 2.9.17
- Basebuildsystem now also writes output to min-cross-sources.list and base-system.list
- Begin work on basenocycles.ml to see how much the minimal system can build without cycle breaking
- Use Depsolver.trim to find source packages that can be built given the restricted universe
- Find the final list of packages that are available without solving staged build dependencies for Natty
- Many code simplifications
I learned a good chunk of ocaml and how to use dose3 and libcudf.
git clone git://gitorious.org/debian-bootstrap/botch.git
The git as of now contains 30 commits and 1197 lines of ocaml code.
So far, 62 emails have been exchanged between me and Pietro and Wookey.
I created a mailinglist for this project where all email exchange so far is publicly accessible in the archives. You can also download all of the email exchange in mbox format. Everybody is welcome to join and/or read the list.
What seems to be finished: the program that finds the minimal amount of source packages that have to be cross compiled to end up with a minimal build system. What it does is:
- get all essential packages
- get their runtime dependencies
- get build-essential plus runtime dependencies
- get all source packages that are necessary to build 1.-3. those are the packages that have to be cross compiled
- get a list of all packages that are built by source packages from 4.
- add all packages from 1.,2.,3. and 5. plus all arch:all packages to a universe
- use Depsolver.trim on that universe to figure out which of those packages are actually installable
The result of 7. will then contain a list of packages that are available automatically on the foreign system due to cross compiled source packages and arch:all packages.
For Debian Sid, the output of my program is:
# (1) number of packages with priority:required: 62 # (2) plus, number of dependencies of priority:required packages: 20 # (3) plus, build-essential and dependencies: 31 # number of source packages to build the above: 71 # number of additional packages built from the above source packages: 292 # (4) number of packages of those plus arch:all packages that are installable: 6421 # total number of installable packages (1)+(2)+(3)+(4): 6534
For Ubuntu Natty it is:
# (1) number of packages with priority:required: 96 # (2) plus, number of dependencies of priority:required packages: 7 # (3) plus, build-essential and dependencies: 31 # number of source packages to build the above: 87 # number of additional packages built from the above source packages: 217 # (4) number of packages of those plus arch:all packages that are installable: 2102 # total number of installable packages (1)+(2)+(3)+(4): 2236
So for Debian, 71 source packages definitely have to be made cross compilable while for Natty, the number is 87.
The last two days I was toying around with these minimal systems to see how big the number of source packages is, that can be built on top of them without running into dependency cycles. After installing the binary packages that were built, I checked again until no new packages could be built.
For Natty, I was only able to find 28 additional packages that can be built on top of the 2236 existing ones. This means that a number of dependency cycles prevent building anything else.
In the coming two weeks I will focus on coming up with a tool that cleverly helps the user to identify packages that would be useful to have for building more packages (probably determined by how many packages depend on it - debhelper is an obvious candidate). The tool would then show why that crucial package is not available (in case of debhelper because some of its runtime dependencies are not available and require debhelper to be built) and how the situation can best be resolved. The possible methods to do so are to identify a package that is part of a cycle and either cross compile it or let it have staged build dependencies.
When packaging software for Debian, there exist two important assumptions:
- Compilation is done natively
- Potentially all of Debian is available at compile time
Both assumptions make the life of a package maintainer much easier and they do not create any problem unless you are one of the unlucky few who want to run Debian on an architecture that it does not yet exist for.
You will then have to use either cross compile a set of base packages (which is hard because packages are built and tested to built natively, not cross - perl is a big blocker of building the minimal set of packages cross but through multiarch other packages become easier to cross build) or use other distributions like OpenEmbedded or Gentoo which you compiled (or retrieved otherwise) for that new architecture to hack a core of Debian source packages until they build a minimal Debian system that you can chroot into and continue natively building the rest of it. But even if you manage to get that far you will continue to be plagued by cyclic build and runtime dependencies. So you start to hack source packages so that they drop some dependencies and you can break enough cycles to advance step by step.
The Debian ports page lists 24 ports of Debian, so despite its unpleasant nature, porting it is something that is not done seldom.
The process as laid out above has a number of drawbacks:
- The process is mostly manual and reinvented every time it is done.
- If you can't cross compile something, then you need another distribution for the bootstrapping process. Debian itself should be sufficient.
- Its complexity and manual nature prevents architectures with little workforce behind them from catching up to the main archive.
- It also avoids that Debian exists in CPU optimized sub-arch builds.
If Debian would provide a set of core packages that are cross-compilable and which suffice for a minimal foreign build system, and if it would also have enough source packages that provide a reduced build dependency set so that all dependency cycles can be broken, building Debian for a yet unknown architecture could be mostly automated.
The benefits would be:
- Putting Debian on a foreign architecture would (in the best case) boil down to making the code cross-compile for and native-compile on that architecture.
- Debian would not need any other distribution to be ported to a different architecture. This would make Debian even more "universal".
- Lagging architectures can be more easily updated or rebooted than when they were initially created.
- Debian optimized for specific CPUs (Raspberry Pie, OpenMoko...) would be more attractive.
With three of this year's GSoC projects, this dream seems to come into reach.
There is the "Multiarch Cross-Toolchains" project by Thibaut Girka and mentored by Hector Oron and Marcin Juszkiewicz. Cross-compiling toolchains need packages from the foreign architecture to be installed alongside the native libraries. Cross-compiler packages have been available through the emdebian repositories but always were more of a hack. With multiarch, it is now possible to install packages from multiple architectures at once, so that cross-compilation toolchains can be realized in a proper manner and therefor can also enter the main archives. Besides creating multiarch enabled toolchains, he will also be responsible for making them build on the Debian builld system as cross-architecture dependencies are not yet supported.
There is also the "Bootstrappable Debian" project by Patrick "P. J." McDermott and mentored by Wookey and Jonathan Austin. He will make a small set of source packages multiarch cross-compilable (using cross-compilers provided by Thibaut Girka) and add a Build-Depends-StageN header to critical packages so that they can be built with reduced build dependencies for breaking dependency cycles. He will also patch tools as necessary to recognize the new control header.
And then there is my project: "Port bootstrap build-ordering tool" (Application). It is mentored by Wookey and Pietro Abate. In contrast to the other two, my output will be more on the meta-level as I will not modify any actual Debian package or patch Debian tools with more functionality. Instead the goal of this project is threefold:
- find the minimal set of source packages that have to be cross compiled
- help the user to find packages that are good candidates for breaking build dependency cycles through added staged build dependencies or by making them cross-compilable
- develop a tool that takes the information about packages that can be cross compiled or have staged build dependencies to output an ordering with which packages must be built to go from nothing to a full archive
More on that project in my follow-up post.