I recently discovered the timestamp counter instruction which solved a problem where I had to accurately benchmark a very small piece of code while putting it in a loop made gcc optimize it away with -O3.
static __inline__ unsigned long long getticks(void)
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long long)a) | (((unsigned long long)d) << 32);
More code for other architectures as well can be found here.
When using that piece one has to take care that the code stays on the same processor, the processor doesnt change its clock speed and the system is not hibernated/suspended inbetween.