network file transfer to marvell kirkwood

categories: code

I have a Seagate GoFlex Net with two 2TB harddrives attached to it via SATA. The device itself is connected to my PC via its Gigabit Ethernet connection. It houses a Marvell Kirkwood at 1.2GHz and 128MB. I am booting Debian from a USB stick connected to its USB 2.0 port.

The specs are pretty neat so I planned it as my NAS with 4TB of storage being attached to it. The most common use case is the transfer of big files (1-10 GB) between my laptop and the device.

Now what are the common ways to achieve this?

scp:

scp /local/path user@goflex:/remote/path

rsync:

rsync -Ph /local/path user@goflex:/remote/path

sshfs:

sshfs -o user@goflex:/remote/path /mnt
cp /local/path /mnt

ssh:

ssh user@goflex "cat > /remote/path" < /local/path

I then did some benchmarks to see how they perform:

scp: 5.90 MB/s

rsync: 5.16 MB/s

sshfs: 5.05 MB/s

ssh: 5.42 MB/s

Since they all use ssh for transmission, the similarity of the result does not come as a surprise and 5.90 MB/s are also not too shabby for a plain scp. It means that I can transfer 1 GB in a bit under three minutes. I could live with that. Even for 10 GB files I would only have to wait for half an hour which is mostly okay since it is mostly known well in advance that a file is needed.

But lets see if we can somehow get faster than this. Lets analyze where the bottleneck is.

Lets have a look at the effective TCP transfer rate with netcat:

ssh user@goflex "netcat -l -p 8000 > /dev/null"
dd if=/dev/zero bs=10M count=1000 | netcat goflex 8000

79.3 MB/s wow! Can we get more? Lets try increasing the buffer size on both ends. This can be done using nc6 with the -x argument on both sides.

ssh user@goflex "netcat -x -l -p 8000 > /dev/null"
dd if=/dev/zero bs=10M count=1000 | netcat -x gloflex 8000

103 MB/s okay this is definitely NOT the bottleneck here.

Lets see how fast I can read from my harddrive:

hdparm -tT /dev/sda

114.86 MB/s.. hmm... and writing to it?

ssh user@goflex "time sh -c 'dd if=/dev/zero of=/remote/path bs=10M count=100; sync'"

42.93 MB/s

Those values are far faster than my puny 5.90 MB/s I get with scp. A look at the CPU usage during transfer shows, that the ssh process is at 100% CPU usage the whole time. It seems the bottleneck was found to be ssh and the encryption/decryption involved.

I'm transferring directly from my laptop to the device. Not even a switch is in the middle so encryption seems to be quite pointless here. Even authentication doesnt seem to be necessary in this setup. So how to make the transfer unencrypted?

The ssh protocol specifies a null cipher for not-encrypted connections. OpenSSH doesnt support this. Supposedly, adding

{ "none", SSH_CIPHER_NONE, 8, 0, 0, EVP_enc_null }

to cipher.c adds a null cipher but I didnt want to patch around in my installation.

So lets see how a plain netcat performs.

ssh user@goflex "netcat -l -p 8000 > /remote/path"
netcat goflex 8000 < /local/path

32.9 MB/s This is far better! Lets try a bigger buffer:

ssh user@goflex "netcat -x -l -p 8000 > /remote/path"
netcat -x goflex 8000 < /local/path

37.8 MB/s now this is far better! My Gigabyte will now take under half a minute and my 10 GB file under five minutes.

But it is tedious to copy multiple files or even a whole directory structure with netcat. There are far better tools for this.

An obvious candidate that doesnt encrypt is rsync when being used with the rsync protocol.

rsync -Ph /local/path user@goflex::module/remote/path

30.96 MB/s which is already much better!

I used the following line to have the rsync daemon being started by inetd:

rsync stream tcp nowait root /usr/bin/rsync rsyncd --daemon

But it is slower than pure netcat.

If we want directory trees, then how about netcatting a tarball?

ssh user@goflex "netcat -x -l -p 8000 | tar -C /remote/path -x"
tar -c /local/path | netcat goflex 8000

26.2 MB/s so tar seems to add quite the overhead.

How about ftp then? For this test I installed vsftpd and achieved a speed of 30.13 MB/s. This compares well with rsync.

I also tried out nfs. Not surprisingly, its transfer rate is up in par with rsync and ftp at 31.5 MB/s.

So what did I learn? Lets make a table:

methodspeed in MB/s
scp5.90
rsync+ssh5.16
sshfs5.05
ssh5.42
netcat32.9
netcat -x37.8
netcat -x | tar26.2
rsync30.96
ftp30.13
nfs31.5

For transfer of a directory structure or many small files, unencrypted rsync seems the way to go. It outperforms a copy over ssh more than five-fold.

When the convenience of having the remote data mounted locally is needed, nfs outperforms sshfs at speeds similar to rsync and ftp.

As rsync and nfs already provide good performance, I didnt look into a more convenient solution using ftp.

My policy will now be to use rsync for partial file transfers and mount my remote files with nfs.

For transfer of one huge file, netcat is faster. Especially with increased buffer sizes it is a quarter faster than without.

But copying a file with netcat is tedious and hence I wrote a script that simplifies the whole remote-login, listen, send process to one command. First argument is the local file, second argument is the remote name and path just as in scp.

#!/bin/sh -e

HOST=${2%%:*}
USER=${HOST%%@*}
if [ "$HOST" = "$2" -o "$USER" = "$HOST" ]; then
        echo "second argument is not of form user@host:path" >&2
        exit 1
fi
HOST=${HOST#*@}
LPATH=$1
LNAME=`basename "$1"`
RPATH=`printf %q ${2#*:}/$LNAME`

ssh "$USER@$HOST" "nc6 -x -l -p 8000 > $RPATH" &
sleep 1.5
pv "$LPATH" | nc6 -x "$HOST" 8000

wait $!

ssh "$USER@$HOST" "md5sum $RPATH" &
md5sum "$LPATH"

wait $!

I use pv to get a status of the transfer on my local machine and ssh to login to the remote machine and start netcat in listening mode. After the transfer I check the md5sum to be sure that everything went fine. This step can of course be left out but during testing it was useful. Escaping of the arguments is done with printf %q.

Problems with the above are the sleep, which can not be avoided but must be there to give the remote some time to start netcat and listen. This is unclean. A next problem with the above is, that one has to specify a username. Another is, that in scp, one has to double-escape the argument while above this is not necessary. The host that it netcats to is the same as the host it ssh's to. This is not necessarily the case as one can specify an alias in ~/.ssh/config. Last but not least this only transfers from the local machine to the remote host. Doing it the other way round is of course possible in the same manner but then one must be able to tell how the local machine is reachable for the remote host.

Due to all those inconveniences I decided not to expand on the above script.

Plus, rsync and nfs seem to perform well enough for day to day use.

View Comments

a periodic counter

categories: code

tldr: counting without cumulative timing errors

Sometimes I want just a small counter, incrementing an integer each second running somewhere in a terminal. Maybe it is because my wristwatch is in the bathroom or because I want to do more rewarding things than counting seconds manually. Maybe I want not only to know how long something takes but also for how long it already ran in the middle of its execution? There are many reason why I would want some script that does nothing else than simply counting upward or downward with some specific frequency.

Some bonuses:

  • the period should be possible to give as a floating point number and especially periods of a fraction of a second would be nice
  • it should be able to execute an arbitrary program after each period
  • it should not matter how long the execution of this program takes for the overall counting

Now this can not be hard, right? One would probably write this line and be done with it:

while sleep 1; do echo $i; i=$((i+1)); done

or to count for a certain number of steps:

for i in `seq 1 100`; do echo $i; sleep 1; done

This would roughly do the job but in each iteration some small offset would be added and though small, this offset would quickly accumulate.

Sure that cumulative error is tiny but given that this task seems to be so damn trivial I couldn't bear anymore with running any of the above but started looking into a solution.

Sure I could just quickly hack a small C script that would check gettimeofday(2) at each iteration and would adjust the time to usleep(3) accordinly but there HAD to be people before me with the same problem who already came up with a solution.

And there was! The solution is the sleepenh(1) program which, when given the timestamp of its last invocation and the sleep time in floating point seconds, will sleep for just the right amount to keep the overall frequency stable.

The author suggests, that sleepenh is to be used in shell scripts that need to repeat an action in a regular time interval and that is just what I did.

The result is trivial and simple but does just what I want:

  • the interval will stay the same on average and the counter will not "fall behind"
  • count upward or downward
  • specify interval length as a floating point number of seconds including fractions of one second
  • begin to count at given integer and count for a specific number of times or until infinity
  • execute a program at every step, optionally by forking it from the script for programs possibly running longer than the given interval

You can check it out and read how to use and what to do with it on github:

https://github.com/josch/periodic

Now lets compare the periodic script with the second example from above:

$ time sh -c 'for i in `seq 1 1000`; do echo $i; sleep 1; done'
0.08s user 0.12s system 0% cpu 16:41.55 total

So after only 1000 iterations, the counter is already off by 1.55 seconds. This means that instead of having run with a frequency of 1.0 Hz, the actual frequency was 1.00155 Hz. Is it too much to not want this 0.155% of error?

$ time ./periodic -c 1000
0.32s user 0.00s system 0% cpu 16:40.00 total

1000 iterations took exactly 1000 seconds. Cool.

View Comments

announcing img2pdf

categories: code

tldr; lossless conversion of JPEG and JPEG2000 files to PDF without significant increase of filesize: https://github.com/josch/img2pdf

Since she knew I was able to do all sorts of fancy stuff with pdf on my computer (inkscape, pdftk, pdflatex, cairo, ghostscript and others), a friend of mine asked me to convert a JPEG of a scanned document into a PDF for her. From a homework assignment that I solved as an undergrad I got a fairly good understanding about the PDF file format and I knew that PDF just uses embedded JPEG data for images. So I thought it would be easy enough to just wrap her JPEG with the PDF file structure and be done with it. Surprisingly it turned out that no tool I knew of (or was able to find on the internets) was able to do exactly that. Surely tools were able to convert my JPEG to a PDF of equal size but they were re-encoding the JPEG and hence lead to quality loss. Others did a lossless conversion but achieved lossless encoding by compressing an RGB representation of the image using zip/flate encoding (the right way to store images lossless in pdf) which increased the filesize manyfold. I knew that it was technically possible to inject her JPEG into a PDF without any byte of the JPEG changing so I was refusing to accept to either loose quality by re-encoding her JPEG or to increase the filesize of the 2.8MB JPEG to a 14MB PDF file. On top of that, imagemagick would take a whopping 27 seconds to convert just a single JPEG to PDF (using lossless zip compression). This is completely unacceptable for bigger conversion tasks. Searching on the internet revealed other people having the same problems and there were some claiming to sit on huge JPEG2000 scan archive they wanted to lossless convert to PDF so I decided to support JPEG2000 as well.

Out came a tool that does exactly what I wanted. It takes image filenames as commandline arguments. If the image is a JPEG or JPEG2000 file, it will dump its content into the pdf structure as it is without changing a single byte. If any other format is found, the normal zip/flate encoding will be used. Thus, in both cases, the encoding is done lossless while in case of JPEG and JPEG2000 images also much smaller than it would've been by forcing re-ecoding as zip/flate. Giving multiple images on the commandline will produce a multipage pdf.

More information is available in the README.md

git clone git://github.com/josch/img2pdf.git

Patches are welcome as usual.

Also tell me if I should've missed another application that can do the same.

View Comments

announcing PyFeRea

categories: code

tldr; An RSS reader without Gnome/KDE dependencies using Python, Gtk, Webkit: https://github.com/josch/pyferea

About a year ago I wrote the following email to the debian-user list:

Subject: is there no sane, minimal, graphical RSS feed reader in existance?

Hi,

I've been looking for a good RSS feed reader for years now but I still seem not
to be able to find a sane, minimal graphical RSS reader.

What I'm using now is liferea which is okay but could be more minimal and
mainly, is way too slow to enjoy using it (search for the fsync issue).

So what is left?

- There is a bunch of web based readers but I treasure having my stuff offline
  as well.

- There are readers for the gnome or KDE environment. Since I use neither it
  would mean to get 100s of MB (literally) of dependencies

- There are firefox extensions but why would I have to install a web browser to
  read my RSS feeds?

- There are thunderbird, evolution and opera but same argument: why would I
  want to install an email-client/browser for my feeds?

- There are clients like blam that are written in .NET/mono and would also
  require dozens of dependencies (not talking about .NET evilness)

- There are readers for the terminal but I have several feeds with images and I
  dont want to open another window of my browser each time.

I can't imagine there are no others who do not use Gnome/KDE (having a more
minimal setup) but would want to have a graphical RSS reader?

What I'm looking for is not much: it would just depend on either
gtk/qt/efl/whatever for its UI, would have one list of the feeds, another list
for recent feed items and another frame with a gecko or webkit plugin for
presenting the item. Why this feature/dependency bloat everywhere?

Why is there no simple reader with minimal dependencies? Am I just overlooking
one? Are my requirements too weird? I'm not afraid to compile from source
either, should it not be in Debian. Should I like it I would also package it
for Debian.

As I said, liferea is close (just had to bear with the gconf2 dependency) but
slow as hell (and no, I refuse to use the "fsync workaround").

Are there others that share my need? If there is really no such thing as a real
minimal graphical RSS reader, I'm close to writing one myself.

Since I'm not subscribed, please dont forget to CC me. Thanks!

cheers, josch

I pasted this email because it best describes my issue and I would've just repeated its content anyways.

Well apparently I was close enough to writing one myself so that I did it, thereby announcing PyFeRea, a minimal RSS reader without Gnome or KDE dependencies, written in Python, using GtK and WebKit and coming without the LiFeRea slowness.

I suck at naming things and hence I am still left with my initial 'codename' for it: PyFeRea as I wanted something that looked just as LiFeRea but would not suffer from its slowness.

More can be found in the projects README.md

git clone git://github.com/josch/pyferea.git

Patches are welcome as usual.

Also tell me if I should've missed an RSS reader with my requirements in mind.

View Comments

transposing csv for gnuplot

categories: oneliner

I recently got a csv that was exported from openoffice spreadsheet with data arranged in rows and not columns as gnuplot likes it. It seems that gnuplot (intentionally) lacks the ability to parse data in rows instead of columns. Hence I had to switch rows and columns (transpose) my input csv such that gnuplot likes it.

Transposing whitespace delimetered text can be done with awk but csv is a bit more complex as it allows quotes and escapes. So a solution had to be found which understood how to read csv.

This turned out to be so simple and minimalistic that I had to post the resulting oneliner that did the job for me:

python -c'import csv,sys;csv.writer(sys.stdout,csv.excel_tab).writerows(map(None,*list(csv.reader(sys.stdin))))'

This will read input csv from stdin and output the transpose to stdout. The transpose is done by using:

map(None, *thelist)

Another way to do the transpose in python is by using:

zip(*thelist)

But this solution doesnt handle rows of different length well.

In addition the solution above will output the csv tab delimetered instead of using commas as gnuplot likes it by using the excel_tab dialect in the csv.writer.

The solution above is problematic when some of the input values inbetween are empty. It is not problematic because the csv would be transposed incorrectly but because gnuplot collapses several whitespaces into one. There are several solutions to that problem. Either, instead of an empty cell, insert "-" in the output:

python -c'import csv,sys; csv.writer(sys.stdout, csv.excel_tab).writerows(map(lambda *x:map(lambda x:x or "-",x),*list(csv.reader(sys.stdin))))'

Or output a comma delimetered cvs and tell gnupot that the input is comma delimetered:

python -c'import csv,sys;csv.writer(sys.stdout).writerows(map(None,*list(csv.reader(sys.stdin))))'

And then in gnuplot:

set datafile separator ","
View Comments
« Older Entries