announcing img2pdf

categories: code

tldr; lossless conversion of JPEG and JPEG2000 files to PDF without significant increase of filesize: https://github.com/josch/img2pdf

Since she knew I was able to do all sorts of fancy stuff with pdf on my computer (inkscape, pdftk, pdflatex, cairo, ghostscript and others), a friend of mine asked me to convert a JPEG of a scanned document into a PDF for her. From a homework assignment that I solved as an undergrad I got a fairly good understanding about the PDF file format and I knew that PDF just uses embedded JPEG data for images. So I thought it would be easy enough to just wrap her JPEG with the PDF file structure and be done with it. Surprisingly it turned out that no tool I knew of (or was able to find on the internets) was able to do exactly that. Surely tools were able to convert my JPEG to a PDF of equal size but they were re-encoding the JPEG and hence lead to quality loss. Others did a lossless conversion but achieved lossless encoding by compressing an RGB representation of the image using zip/flate encoding (the right way to store images lossless in pdf) which increased the filesize manyfold. I knew that it was technically possible to inject her JPEG into a PDF without any byte of the JPEG changing so I was refusing to accept to either loose quality by re-encoding her JPEG or to increase the filesize of the 2.8MB JPEG to a 14MB PDF file. On top of that, imagemagick would take a whopping 27 seconds to convert just a single JPEG to PDF (using lossless zip compression). This is completely unacceptable for bigger conversion tasks. Searching on the internet revealed other people having the same problems and there were some claiming to sit on huge JPEG2000 scan archive they wanted to lossless convert to PDF so I decided to support JPEG2000 as well.

Out came a tool that does exactly what I wanted. It takes image filenames as commandline arguments. If the image is a JPEG or JPEG2000 file, it will dump its content into the pdf structure as it is without changing a single byte. If any other format is found, the normal zip/flate encoding will be used. Thus, in both cases, the encoding is done lossless while in case of JPEG and JPEG2000 images also much smaller than it would've been by forcing re-ecoding as zip/flate. Giving multiple images on the commandline will produce a multipage pdf.

More information is available in the README.md

git clone git://github.com/josch/img2pdf.git

Patches are welcome as usual.

Also tell me if I should've missed another application that can do the same.

View Comments
blog comments powered by Disqus