tldr; lossless conversion of JPEG and JPEG2000 files to PDF without significant increase of filesize: https://github.com/josch/img2pdf
Since she knew I was able to do all sorts of fancy stuff with pdf on my computer (inkscape, pdftk, pdflatex, cairo, ghostscript and others), a friend of mine asked me to convert a JPEG of a scanned document into a PDF for her. From a homework assignment that I solved as an undergrad I got a fairly good understanding about the PDF file format and I knew that PDF just uses embedded JPEG data for images. So I thought it would be easy enough to just wrap her JPEG with the PDF file structure and be done with it. Surprisingly it turned out that no tool I knew of (or was able to find on the internets) was able to do exactly that. Surely tools were able to convert my JPEG to a PDF of equal size but they were re-encoding the JPEG and hence lead to quality loss. Others did a lossless conversion but achieved lossless encoding by compressing an RGB representation of the image using zip/flate encoding (the right way to store images lossless in pdf) which increased the filesize manyfold. I knew that it was technically possible to inject her JPEG into a PDF without any byte of the JPEG changing so I was refusing to accept to either loose quality by re-encoding her JPEG or to increase the filesize of the 2.8MB JPEG to a 14MB PDF file. On top of that, imagemagick would take a whopping 27 seconds to convert just a single JPEG to PDF (using lossless zip compression). This is completely unacceptable for bigger conversion tasks. Searching on the internet revealed other people having the same problems and there were some claiming to sit on huge JPEG2000 scan archive they wanted to lossless convert to PDF so I decided to support JPEG2000 as well.
Out came a tool that does exactly what I wanted. It takes image filenames as commandline arguments. If the image is a JPEG or JPEG2000 file, it will dump its content into the pdf structure as it is without changing a single byte. If any other format is found, the normal zip/flate encoding will be used. Thus, in both cases, the encoding is done lossless while in case of JPEG and JPEG2000 images also much smaller than it would've been by forcing re-ecoding as zip/flate. Giving multiple images on the commandline will produce a multipage pdf.
More information is available in the README.md
git clone git://github.com/josch/img2pdf.git
Patches are welcome as usual.
Also tell me if I should've missed another application that can do the same.
tldr; An RSS reader without Gnome/KDE dependencies using Python, Gtk, Webkit: https://github.com/josch/pyferea
About a year ago I wrote the following email to the debian-user list:
Subject: is there no sane, minimal, graphical RSS feed reader in existance? Hi, I've been looking for a good RSS feed reader for years now but I still seem not to be able to find a sane, minimal graphical RSS reader. What I'm using now is liferea which is okay but could be more minimal and mainly, is way too slow to enjoy using it (search for the fsync issue). So what is left? - There is a bunch of web based readers but I treasure having my stuff offline as well. - There are readers for the gnome or KDE environment. Since I use neither it would mean to get 100s of MB (literally) of dependencies - There are firefox extensions but why would I have to install a web browser to read my RSS feeds? - There are thunderbird, evolution and opera but same argument: why would I want to install an email-client/browser for my feeds? - There are clients like blam that are written in .NET/mono and would also require dozens of dependencies (not talking about .NET evilness) - There are readers for the terminal but I have several feeds with images and I dont want to open another window of my browser each time. I can't imagine there are no others who do not use Gnome/KDE (having a more minimal setup) but would want to have a graphical RSS reader? What I'm looking for is not much: it would just depend on either gtk/qt/efl/whatever for its UI, would have one list of the feeds, another list for recent feed items and another frame with a gecko or webkit plugin for presenting the item. Why this feature/dependency bloat everywhere? Why is there no simple reader with minimal dependencies? Am I just overlooking one? Are my requirements too weird? I'm not afraid to compile from source either, should it not be in Debian. Should I like it I would also package it for Debian. As I said, liferea is close (just had to bear with the gconf2 dependency) but slow as hell (and no, I refuse to use the "fsync workaround"). Are there others that share my need? If there is really no such thing as a real minimal graphical RSS reader, I'm close to writing one myself. Since I'm not subscribed, please dont forget to CC me. Thanks! cheers, josch
I pasted this email because it best describes my issue and I would've just repeated its content anyways.
Well apparently I was close enough to writing one myself so that I did it, thereby announcing PyFeRea, a minimal RSS reader without Gnome or KDE dependencies, written in Python, using GtK and WebKit and coming without the LiFeRea slowness.
I suck at naming things and hence I am still left with my initial 'codename' for it: PyFeRea as I wanted something that looked just as LiFeRea but would not suffer from its slowness.
More can be found in the projects README.md
git clone git://github.com/josch/pyferea.git
Patches are welcome as usual.
Also tell me if I should've missed an RSS reader with my requirements in mind.
This is the title of a great talk at the last chaos communication congress in berlin (27c3).
When writing my own pdf parser for a homework assignment that I put way too much ambition into, I encountered all of what is mentioned in that talk and was also able to realize how bad the situation really is. I just deleted three paragraphs of this post where I started to rant about how frickin bad the pdf format is. How unworkable it is and how literally impossible to perfectly implement. But instead, just watch the video of the talk and make sure to remember that it is even worse than Julia Wolf is able to make clear in 1h she was given.
So after I stopped myself from spreading another wave of my pdf hate over the internets lets look at the issue at hand:
This is the snippet I uncompressed from the pdf to (just by chance) find the number I was looking for. The 000000 piece was actually containing the number I needed.
6 0 obj
/DA (/Arial 14 Tf 0 g)
/Rect [ 243.249 176.784 382.489 210.297 ]
/BG [ 0.75 0.75 0.75 ]
/P 4 0 R
/N 7 0 R
So let me say: WTF? My bank not only requires me to resort to one specific pdf implementation (namely the acrobat reader by adobe) but also requires me to pay to a US based company first to have an operating system that reader software works on? Or am I really supposed to go through the raw pdf source by hand?? Bleh...
Also, dont ask for my code - it's super dirty and unreadable. Instead look at the mupdf project. It supplies a renderer which is massively superior to poppler in terms of speed (even suitable for embedded devices) and comes with a program called pdfclean which does the same thing my program did so that I was able to get the number I needed.
I recently discovered the timestamp counter instruction which solved a problem where I had to accurately benchmark a very small piece of code while putting it in a loop made gcc optimize it away with -O3.
static __inline__ unsigned long long getticks(void)
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long long)a) | (((unsigned long long)d) << 32);
More code for other architectures as well can be found here.
When using that piece one has to take care that the code stays on the same processor, the processor doesnt change its clock speed and the system is not hibernated/suspended inbetween.
Using terminal applications instead of GUI applications has the definitive speed advantage of not having to spend some time on moving a cursor to a 2D coordinate on the screen but instead just doing a 2mm downward motion with one or two fingers. I wonder if something other than terminal applications will allow me to interact with my computer faster until the invention of direct neural interfaces.
I really like terminal applications - not only because of the speed advantage but also because they offer much more real-estate in terms of usable screen space as they dont waste space on stuff like buttons, menu bars or all those pixels wasted on separators and the space between UI elements. Starting to use the pentadactyl firefox extension was not only a huge browsing speed improvement for me but I could also finally use my whole frickin 1920x1080 pixels for viewing the website (well except for a 19 pixels status bar at the bottom).
scriptreplay(1), being part of the bsdutils package in
debian/ubuntu and the util-linux package in rpm based distribution will
probably also already be installed on your system and is one of those really
handy tools you did not know of before even though they were always there.
script is a program that you can use to capture a terminal session or console
application output whereas
scriptreplay is able to replay that session by
using a timing file that
script is able to output on stderr. Without the
timings file, the typescript will include all terminal interaction and is
readily readable with a text editor or printable or easily uploadable to a
pastebin (no more selecting parts of your terminal window with your mouse and
copy-pasting that into the browser). Without the timing file it is useful to
document a process - for example if you want to show others how a bug happened
on your system or for a homework submission you have to hand in.
A very powerful feature is the mentioned timingfile you can capture by
stderr output of
script into a file. With the use of
scriptreplay you can then watch your terminal interaction in real time.
While it would be kinda tedious to share your typescript and timingfile over a
pastebin so that a party would have to download those manually and use
scriptreplay to watch them, I imagined a kind of youtube for terminal
sessions. This would also solve the problem of all those youtube videos that
are screen captures of terminal windows - needlessly encoding text as moving
images and by that not only destroying the ability to copy&paste things but
also needlessly increasing the filesize.
I remembered having seen such a website years ago but dont manage to find it again. To show a proof of concept I prepared the following website:
I will probably never have the time to make a real webservice out of it but maybe something of this is useful for others.