When I wanted to use my dependency graph analysis tools to analyze earlier states of Debian Sid, I naturally used snapshot.debian.org to retrieve the Packages and Sources files from which my tools retrieve the dependency information.
The problem is, that many of those Packages and Sources files contain syntax errors that make the dose3 parser choke. This leads to my tools being unable to parse the affected files.
The following script does not only download all Packages and Sources files in a five day interval (4460 MB from 2005/03/12 to 2012/10/11) but also cleans all the syntax errors that were not parsable by dose3. This includes invalid version naming, architecture lists separated by commas, disjunctions in Conflict fields and incorrect braces/bracket usage.
Maybe this helps others who also want to profit from Packages and Sources files from the past.
Fun fact #1: starting from June 2010, there were no more syntax errors in the Packages and Sources files of Debian Sid.
Fun fact #2: starting from December 2009, there are no more mismatches between versions of binary packages in the Packages file and the versions of the corresponding source packages in the Sources file.