announcing img2pdf

categories: code

tldr; lossless conversion of JPEG and JPEG2000 files to PDF without significant increase of filesize: https://github.com/josch/img2pdf

Since she knew I was able to do all sorts of fancy stuff with pdf on my computer (inkscape, pdftk, pdflatex, cairo, ghostscript and others), a friend of mine asked me to convert a JPEG of a scanned document into a PDF for her. From a homework assignment that I solved as an undergrad I got a fairly good understanding about the PDF file format and I knew that PDF just uses embedded JPEG data for images. So I thought it would be easy enough to just wrap her JPEG with the PDF file structure and be done with it. Surprisingly it turned out that no tool I knew of (or was able to find on the internets) was able to do exactly that. Surely tools were able to convert my JPEG to a PDF of equal size but they were re-encoding the JPEG and hence lead to quality loss. Others did a lossless conversion but achieved lossless encoding by compressing an RGB representation of the image using zip/flate encoding (the right way to store images lossless in pdf) which increased the filesize manyfold. I knew that it was technically possible to inject her JPEG into a PDF without any byte of the JPEG changing so I was refusing to accept to either loose quality by re-encoding her JPEG or to increase the filesize of the 2.8MB JPEG to a 14MB PDF file. On top of that, imagemagick would take a whopping 27 seconds to convert just a single JPEG to PDF (using lossless zip compression). This is completely unacceptable for bigger conversion tasks. Searching on the internet revealed other people having the same problems and there were some claiming to sit on huge JPEG2000 scan archive they wanted to lossless convert to PDF so I decided to support JPEG2000 as well.

Out came a tool that does exactly what I wanted. It takes image filenames as commandline arguments. If the image is a JPEG or JPEG2000 file, it will dump its content into the pdf structure as it is without changing a single byte. If any other format is found, the normal zip/flate encoding will be used. Thus, in both cases, the encoding is done lossless while in case of JPEG and JPEG2000 images also much smaller than it would've been by forcing re-ecoding as zip/flate. Giving multiple images on the commandline will produce a multipage pdf.

More information is available in the README.md

git clone git://github.com/josch/img2pdf.git

Patches are welcome as usual.

Also tell me if I should've missed another application that can do the same.

View Comments

announcing PyFeRea

categories: code

tldr; An RSS reader without Gnome/KDE dependencies using Python, Gtk, Webkit: https://github.com/josch/pyferea

About a year ago I wrote the following email to the debian-user list:

Subject: is there no sane, minimal, graphical RSS feed reader in existance?

Hi,

I've been looking for a good RSS feed reader for years now but I still seem not
to be able to find a sane, minimal graphical RSS reader.

What I'm using now is liferea which is okay but could be more minimal and
mainly, is way too slow to enjoy using it (search for the fsync issue).

So what is left?

- There is a bunch of web based readers but I treasure having my stuff offline
  as well.

- There are readers for the gnome or KDE environment. Since I use neither it
  would mean to get 100s of MB (literally) of dependencies

- There are firefox extensions but why would I have to install a web browser to
  read my RSS feeds?

- There are thunderbird, evolution and opera but same argument: why would I
  want to install an email-client/browser for my feeds?

- There are clients like blam that are written in .NET/mono and would also
  require dozens of dependencies (not talking about .NET evilness)

- There are readers for the terminal but I have several feeds with images and I
  dont want to open another window of my browser each time.

I can't imagine there are no others who do not use Gnome/KDE (having a more
minimal setup) but would want to have a graphical RSS reader?

What I'm looking for is not much: it would just depend on either
gtk/qt/efl/whatever for its UI, would have one list of the feeds, another list
for recent feed items and another frame with a gecko or webkit plugin for
presenting the item. Why this feature/dependency bloat everywhere?

Why is there no simple reader with minimal dependencies? Am I just overlooking
one? Are my requirements too weird? I'm not afraid to compile from source
either, should it not be in Debian. Should I like it I would also package it
for Debian.

As I said, liferea is close (just had to bear with the gconf2 dependency) but
slow as hell (and no, I refuse to use the "fsync workaround").

Are there others that share my need? If there is really no such thing as a real
minimal graphical RSS reader, I'm close to writing one myself.

Since I'm not subscribed, please dont forget to CC me. Thanks!

cheers, josch

I pasted this email because it best describes my issue and I would've just repeated its content anyways.

Well apparently I was close enough to writing one myself so that I did it, thereby announcing PyFeRea, a minimal RSS reader without Gnome or KDE dependencies, written in Python, using GtK and WebKit and coming without the LiFeRea slowness.

I suck at naming things and hence I am still left with my initial 'codename' for it: PyFeRea as I wanted something that looked just as LiFeRea but would not suffer from its slowness.

More can be found in the projects README.md

git clone git://github.com/josch/pyferea.git

Patches are welcome as usual.

Also tell me if I should've missed an RSS reader with my requirements in mind.

View Comments

transposing csv for gnuplot

categories: oneliner

I recently got a csv that was exported from openoffice spreadsheet with data arranged in rows and not columns as gnuplot likes it. It seems that gnuplot (intentionally) lacks the ability to parse data in rows instead of columns. Hence I had to switch rows and columns (transpose) my input csv such that gnuplot likes it.

Transposing whitespace delimetered text can be done with awk but csv is a bit more complex as it allows quotes and escapes. So a solution had to be found which understood how to read csv.

This turned out to be so simple and minimalistic that I had to post the resulting oneliner that did the job for me:

python -c'import csv,sys;csv.writer(sys.stdout,csv.excel_tab).writerows(map(None,*list(csv.reader(sys.stdin))))'

This will read input csv from stdin and output the transpose to stdout. The transpose is done by using:

map(None, *thelist)

Another way to do the transpose in python is by using:

zip(*thelist)

But this solution doesnt handle rows of different length well.

In addition the solution above will output the csv tab delimetered instead of using commas as gnuplot likes it by using the excel_tab dialect in the csv.writer.

The solution above is problematic when some of the input values inbetween are empty. It is not problematic because the csv would be transposed incorrectly but because gnuplot collapses several whitespaces into one. There are several solutions to that problem. Either, instead of an empty cell, insert "-" in the output:

python -c'import csv,sys; csv.writer(sys.stdout, csv.excel_tab).writerows(map(lambda *x:map(lambda x:x or "-",x),*list(csv.reader(sys.stdin))))'

Or output a comma delimetered cvs and tell gnupot that the input is comma delimetered:

python -c'import csv,sys;csv.writer(sys.stdout).writerows(map(None,*list(csv.reader(sys.stdin))))'

And then in gnuplot:

set datafile separator ","
View Comments

xen hypervisor on qemu kvm and domu nfs boot with vde

categories: debian

Let me share how to setup xen inside qemu with kvm support and domus booting over nfs from the qemu host and connecting multiple of those instances together using vde networking. The debian installer, the debian inside qemu, the domus and debootstrap will use my local apt-cacher setup at port 3142.

This setup is based on Debian wheezy (testing at this point). To install testing, grab the latest debian installer business card image from here:

wget http://cdimage.debian.org/cdimage/daily-builds/daily/arch-latest/amd64/iso-cd/debian-testing-amd64-businesscard.iso

Then create a disk image for qemu to use as its harddisk, create a sparse 3000mb file with dd:

dd if=/dev/zero of=disk.img bs=1 count=1 seek=3000MiB

Using the debian installer to setup the system is preferable in comparison to creating a rootfs with debootstrap as the xen hypervisor wants to be booted by grub. Qemu does not yet support booting the xen hypervisor straight way as it can boot a linux kernel with the -kernel option. Grub installation and partitioning is most easily done by just using d-i.

To automate the installation I'm using the following preseed file:

d-i debian-installer/locale string en_US
d-i console-keymaps-at/keymap select us
d-i keyboard-configuration/xkb-keymap select us
d-i netcfg/choose_interface select auto
d-i netcfg/get_hostname string debian
d-i netcfg/get_domain string 
d-i mirror/country string manual
d-i mirror/http/hostname string 10.0.2.2:3142
d-i mirror/http/directory string /ftp.de.debian.org/debian
d-i mirror/suite string wheezy
d-i mirror/udeb/suite string wheezy
d-i passwd/root-login boolean true
d-i passwd/make-user boolean false
d-i passwd/root-password password root
d-i passwd/root-password-again password root
d-i clock-setup/utc boolean true
d-i time/zone string UTC
d-i clock-setup/ntp boolean true
d-i partman-auto/method string regular
d-i partman-auto/choose_recipe select atomic
d-i partman-partitioning/confirm_write_new_label boolean true
d-i partman/choose_partition select finish
d-i partman/confirm boolean true
d-i partman/confirm_nooverwrite boolean true
d-i base-installer/install-recommends boolean false
d-i base-installer/kernel/image select none
tasksel tasksel/first multiselect 
d-i pkgsel/include string xen-linux-system-amd64 xen-tools xen-utils
d-i finish-install/reboot_in_progress note

It answers all questions for debconf so that no user input is needed. It will tell d-i to use the apt-cacher setup on the qemu host (10.0.2.2 is the default ip address of the host from inside qemu when using usermode networking), will install wheezy, will set the root password to "root", will only create a single partition for / on the virtual harddrive, will not install recommends and not install the "standard" tasksel target but will install the xen hypervisor and some xen tools. I uploaded the file to http://mister-muffin.de/debian/preseed3.txt

To make qemu use virtualization features of the host cpu, one only needs to install the qemu-kvm package.

apt-get install qemu-kvm

From that point on, qemu will automagically use kvm. The speedup gained is tremendous. Instead of taking 13 minutes to boot my xen dom0 (ridiculous) the machine would boot up in only 3 minutes (workable).

Now start qemu, giving it disk.img as the harddisk and debian-testing-amd64-businesscard.iso as the boot medium in the cd drive.

qemu-system-x86_64 -m 1024 -hda disk.img -cdrom

debian-testing-amd64-businesscard.iso

The isolinux boot menu will pop up. Choose "Advanced options" and then select "Automated install" and press [TAB] to edit the boot commandline. Append the preseed url for debconf like this to the end:

preseed/url=http://mister-muffin.de/debian/preseed3.txt

After hitting enter the system will install by itself. After it is done it will automatically reboot into debian. The hypervisor will not boot by default (bug#603832) so change the grub priority from inside the virtual machine by doing:

mv /etc/grub.d/10_linux /etc/grub.d/21_linux
update-grub

You also do not want to continue using qemu in graphical mode but want to connect to the virtual machine via serial. To do so, let a tty spawn on the serial line in inittab:

echo "T0:23:respawn:/sbin/getty -L ttyS0 9600 vt100" >> /etc/inittab

In comparison to the graphical SDL display, the advantages are easy copy and paste, using the same keyboard layout as the host, screensaver is not deactivated, proper console font, terminal, window manager integration, no grabbing and ungrabbing and working system bell.

To also get qemu, kernel and init output on serial add the following options to /etc/default/grub:

GRUB_CMDLINE_LINUX="console=ttyS0"
GRUB_TERMINAL=serial

And run update-grub again.

Now qemu can be started like this:

qemu-system-x86_64 -m 1024 -hda disk.img -nographic

It will automatically boot the xen hypervisor and attach a tty to serial when the boot is finished.

Once it is, configure xen. Activate bridged networking by uncommenting the following line in the xen config:

(network-script network-bridge)

For bridging to work, xen will need the brctl utility of the bridge-utils package (bug#648816).

apt-get install bridge-utils

Since the debian mirror is still the same as during installation, the apt-cacher setup from back then will still be used which makes installation of additional packages extremely fast.

There are then several ways to create a new domu. The easiest one is to just call:

xen-create-image --hostname=vm01 --dir=/root --dhcp --noswap --size=400Mb

This command will first run debootstrap and then configure the result of it. Since the debian mirror of the host system is chosen as the default, debootstrap will run reasonably fast.

A faster way is to have a tarball that contains the result of a debootstrap run ready and then calling xen-create-image the --install-method=tar option.

So either inside or outside qemu (outside is naturally faster) run debootstrap like this:

debootstrap --variant=minbase wheezy target-directory http://127.0.0.1:3142/ftp.de.debian.org/debian

Tar it, put it inside the virtual machine and then inside qemu:

xen-create-image --hostname=vm01 --dir=/root --dhcp --noswap --size=400Mb --install-method=tar --install-source=/root/vm01.tar

This command will unpack the tarball into a disk image and then configure it.

Instead of running xen-create-image inside qemu, you can also run it on the qemu host which will be faster but if you do not want to nfs boot but boot from the disk image it creates, dont forget to copy the xen domu config it creates inside the virtual machine.

Instead of xen-create-image you can also do all steps manually. So first run debootstrap as usual and then do some basic configuration. The most important part would be the activation of the xen tty in inittab. xen-create-image will call a number of hook scripts which do this configuration. Those hooks can also be run manually on a manually created debootstrap root directory like this:

export verbose=true
export hostname=foobar
export dhcp=true
export mirror=http://127.0.0.1:3142/ftp.de.debian.org/debian
export dist=wheezy
for script in `find /usr/lib/xen-tools/debian.d -type f ! -name '90-make-fstab' | sort -n`; do
        $script /home/josch/debian-wheezy
done

Editing fstab is not strictly needed and only hurts when using nfs boot. Dont execute the fstab hook when using nfs and generally have a look in each of the hooks to find out what they do. It is saver to run xen-create-image but to understand what it does, look into the hooks in /usr/lib/xen-tools/debian.d.

Also as a note, you can always mount the qemu image using:

mount -o loop,offset=1048576 disk.img /mnt

The offset of the first partition can be found out using fdisk on disk.img.

In our setup we want to boot the domu from a root directory which is served by the host of the virtual machine. Doing so will just require a proper xen configuration and no disk space on the hypervisor side is used.

Either create a configuration from scratch or use:

xen-create-nfs --hostname=vm01 --dhcp --nfs_server=10.0.2.2 --nfs_root=/srv/nfs/vm01 --memory=128

And then edit the result so that it looks like this:

    kernel     = '/boot/vmlinuz-3.0.0-1-amd64'
    ramdisk    = '/boot/initrd.img-3.0.0-1-amd64'
    vcpus       = '1'
    memory     = '128'
    name       = 'vm01'
    hostname   = 'vm01'
    dhcp       = 'dhcp'
    vif        = [ '' ]
    nfs_server = '10.0.2.2'
    nfs_root   = '/srv/nfs/vm01'
    root       = '/dev/nfs'
    extra      = 'boot=nfs root=/dev/nfs'

You should only have to add the 'extra' option as this is important for the initrd to boot from nfs.

On your host do:

apt-get install nfs-kernel-server

And then add a directory serving a rootfs in /etc/exports:

/srv/nfs 127.0.0.1(rw,sync,no_subtree_check,no_root_squash,insecure)

These options will only allow localhost to access it. The insecure option is necessary because of the choice of port the initrd will connect from.

The rootfs can be created using debootstrap and then running the hooks as explained above, or by taking a filesystem image that was created by xen-create-image and extracting its contents to the directory exported as an nfs share as this image will already contain the modifications needed for a proper boot.

In qemu you can now start the vm and connect to it using:

xm create /etc/xen/vm01.cfg -c

One disconnects with it using the ctrl+] escape as in telnet. To reconnect, use:

xm console vm01

To now connect multiple qemu instances, each running a hypervisor together so that they can each access the internet and talk to each other, the most convenient setup is using VDE networking. It is a network bridge implemented in userspace (no superuser priviliges required) connecting the machines together (bridging them) using socket communication. Together with the slirp module this bridge is connected to the outer world and slirp can even provide dhcp to the qemu instances connected to it.

To start vde:

vde_switch

The -daemon switch can be used to send the process into the background and the -sock switch can be used to supply a socket different from the default /tmp/vde.ctl.

To start slirp:

slirpvde -dhcp

The --daemon option will send it to the background as well while the --sock option allows to supply a custom control socket.

After the bridge is set up and slirp is connected to it, start qemu like this:

qemu-system-x86_64 -m 1024 -hda disk.img -nographic -net nic,macaddr=XX:XX:XX:XX:XX:XX -net vde,sock=/tmp/vde.ctl

Where XX:XX:XX:XX:XX:XX is a unique mac address that can be generated by doing:

printf 'DE:AD:BE:EF:%02X:%02X\n' $((RANDOM%256)) $((RANDOM%256))

The ip addresses supplied by the slirp dhcp will be the same as by qemu usermode networking so the setup from above doesnt change.

View Comments

adblocking with a hosts file

categories: blog

Naturally adblock plus is a must-have extension for firefox but other programs displaying websites might not offer such a facility.

To block ads on any application accessing the internet, the use of a hosts file which redirects requests to certain hostnames to 127.0.0.1 (which will refuse incoming connections) provides a universal method to get rid of advertisements.

The question is how to obtain a list of malicious hosts.

Searching around revealed three lists that seemed to be well-maintained:

  • http://winhelp2002.mvps.org/hosts.htm
  • http://pgl.yoyo.org/adservers/
  • http://someonewhocares.org/hosts/

The according hosts-file entries can be found under these urls respectively:

  • http://winhelp2002.mvps.org/hosts.txt
  • http://pgl.yoyo.org/adservers/serverlist.php?hostformat=hosts&mimetype=plaintext
  • http://someonewhocares.org/hosts/hosts

I also looked into the adblock plus filter rules but they mostly contain expressions for the path, query and fragment part of URIs and not so much hostnames. This makes sense because using its syntax adblock plus is able to block with much more accuracy than just blocking whole domains.

Now I wanted a combined list of them without duplicates so I cleaned them up using the following sed expression:

sed 's/\([^#]*\)#.*/\1/;s/[ \t]*$//;s/^[ \t]*//;s/[ \t]\+/ /g'

It removes comments, whitespace at the beginning and end of the line and reduces any additional whitespace (between ip and hostname) to only one space. I would then run the output through sort and uniq and append the result to my /etc/hosts.

What is still problematic about this approach is, that if one doesnt have a service bound to 127.0.0.1:80 then every application trying to establish a TCP connection to it will meaninglessly wait for localhost to respond until timeout is reached. To avoid this and immediately send a tcp RST when the browser is redirected to 127.0.0.1 when it tries to retrieve an advertisement, I use the following iptables rule:

iptables -A INPUT -i lo -p tcp -m tcp --dport 80 -j REJECT --reject-with tcp-reset

Some hosts that you might also want to add to your /etc/hosts because they are there to track users are:

127.0.0.1 www.google-analytics.com
127.0.0.1 auto.search.msn.com
127.0.0.1 ad.doubleclick.net
127.0.0.1 google-analytics.com
127.0.0.1 stat.livejournal.com
127.0.0.1 stats.surfaid.ihost.com
127.0.0.1 ads.imeem.com

They are not included by default in the lists above because it might break some websites if they were.

EDIT (2012-05-21)

I forgot to include port 443 for https in the iptables rule above. For example google uses https for googleadservices.com and others might too, so dont forget to also reset connections to port 443 with the rule given above.

View Comments
« Older Entries -- Newer Entries »