A recent email thread between The Oaktable Network members put me into the mood for another post in the Little Things Doth Crabby Make series. The topic at hand was this post on the MySQL Performance Blog about timing queries. In short, the post was about timing MySQL queries using clock_gettime(). In that post, the blogger showed the following pseudo code to help describe the problem he was seeing:
start_time = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp); ... query_execution ... end_time = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp); total_time = end_time - start_time;
According to the post, the time rendered was 18446744073709550.000 which, of course, on a 2.6 GHz processor would be 82 days. What the blogger likely didn’t know is that when called with this argument the clock_gettime() routine uses the CPU time stamp counter (rdtc). As soon as I saw 18.4 quadrillion (or should I say billiard) I knew this was a clock wrap issue. But, to be honest, I had to look at the manpage to see what CLOCK_THREAD_CPUTIME_ID actually does. It turns out for threaded (pthread) programs this call will use the processor time stamp counter. The idea of wrapping rdtsc in a function call seems bizarre to me but to each their own.
Comparing an x86 processor time stamp counter on one CPU against another CPU will result in bizarre arithmetic results. Well, of course, since the time stamp counters are local to the CPU (not synchronized across CPUs). I know a bit about this topic since I started using rdtsc() to time tight code back in the Pentium Pro days (circa 1996). And, yes, you have to lock down (hard processor affinity) the process using rdtsc() to one CPU. But that isn’t all. Actually, the most accurate high-resolution timing goes more like this:
- Hard Affinity me to CPU N
- Disable process preemption (only good operating systems support this)
- Serialize CPU with CPUID
- rdtsc()
- do whatever it is you are timing (better not be any blocking code or device drivers involved)
- rdtsc()
- Re-Enable process preemption
- Release from Hard Affinity (if desired)
But all that is just trivial pursuit because I don’t think anyone should time a MySQL query (or any SQL query for that matter) with nanosecond resolution anyway. And, after all, that is not what I’m blogging about. This is supposed to be Little Things Doth Crabby Make post. So what am I blogging about?
Some Linux Manpages Make Me Crabby
The latest Linux manpage to make me crabby is indeed the manpage for clock_gettime(2). I like how it insinuates a requirement for hard processor affinity, but take a look at the following paragraph from the manpage
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
Words Matter
Using the term migrated in this context it is totally wrong, especially for NUMA-minded people. And, if you can’t tell by my blogging of late, I assert that we are all, or should be, but definitely will be, NUMA-minded folks now that Intel has entered the Commodity NUMA Implementation market with the insanely cool Xeon 5500 “Nehalem” processor and QPI.
The only time the term migrate can be used in the context of process scheduling is when a NUMA system is involved. The clock_gettime(2) manpage was merely referring to a process being scheduled on different CPUs during its life. Generically speaking, there is no migration involved with that. It is a simple context switch. Come to think of it, context switch is a term that routinely falls prey to this sort of misuse. Too often I see the term context switch used to refer to a process entering the kernel. That is not a context switch. A context switch is a scheduling term specifically meaning the stopping of a process, saving of its state and switching to the next selected runable process. Now, having said that, the next time a stopped process (either voluntarily blocked or time-sliced off) is scheduled it could very well be on a different CPU. But that is not a migration.
Enter NUMA
A process migration is a NUMA-specific term related to the “re-homing” of a process’ memory from one NUMA “node” to another. Consider a process that is exec()ed on “node 0” of a NUMA system. The pages of its text, stack, heap, page tables and all other associated kernel structures will reside in node 0 memory. What if a system imbalance occurs such that the CPUs of node 1 are generally idle whereas CPUs of node 0 are generally saturated? Well, the scheduler can simply run the processes homed on node 0 on node 1 processors. That is called remote execution and one very important side effect of remote execution is that any memory resources required while doing so will have to be yanked from the remote memory and installed in the local cache. Historical NUMA systems (e.g. Pioneer, Proprietary NUMA Implementations) had specialized NUMA caches on each node to house memory lines being used during remote execution. The Sequent NUMAQ-2000, for instance, offered a 512 MB “remote cache.” In aggregate, that was 8 GB of remote cache on a system that supported a maximum of 64 GB RAM! CNI systems do not have specialized NUMA caches but instead a simple L3 cache that is generally quite small (e.g., 8 MB). I admit I have not done many tests to analyze remote execution versus migration on Xeon 5500 based systems. In general (as I point out in this post) extreme low latency and huge interconnect bandwidth (ala Xeon 5500) can mitigate a potentially undersized cache for remote lines, but the proof is only in the pudding of actual measurements. More on that soon I hope.
What Was It That Made Him Crabby?
The use of the NUMA-sanctified term migrate in the clock_gettime(2) manpage! Seems too picky, doesn’t it? OK, since I’m discussing NUMA and trying to justify an installment in the Little Things Doth Crabby Make series, how about this from the numactl(8) manpage:
EXAMPLES numactl --interleave=all bigdatabase arguments Run big database with its memory interleaved on all CPUs. numactl --cpubind=0--membind=0,1 process Run process on node 0 with memory allocated on node 0 and 1. numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting state. numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared memory regiion specified by /tmp/shmkey over all nodes. numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1. numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to the default localalloc policy.
Do you think “Run big database with its memory interleaved on all CPUs” or “Run big database with its memory interleaved on all CPUs” are arguments to the numactl command? No? Me neither. Sure looks like it in the examples section though. Not very tidy.
It doesn’t really make me crabby…this is just blogging.
Hmm.. I am not sure this is totally right. Most OS have some sort of CPU affinity for Threads or Processes even when not in a NUMA Scheduler (I guess mostly because of 1st level caches, dont know). So the scheduler does make the decision to schedule the next timeslice on another core, and that could be called “migration”.
Hi Bernd,
All modern schedulers try to maintain “soft affinity” of processes to processors in order to reduce processor cache thrashing. When the scheduler (executing on some processor) switches between processes and decides to execute a process that last ran on a different processor it does so based on code that hopes to balance cache refresh cost versus idle processor cost. Nonetheless, when a processor switches to a process that last executed elsewhere, the new process has not migrated anywhere.
It is true that migration is indeed a distributed computer scheduling concept whether NUMA, COMA or other…but not applicable to flat-memory SMPs.
Well, the Linux Kernel uses a different terminology than you use. See kernel/sched.c. There are multiple methods calling this migration. there is a migrate_task() method and it is called not only in the numa case.
However since there seem to be different definition of task migration, so it migh be a good idea to file a bug report against the man page to make this more clear.
Greetings
Bernd
Bernd,
I know the Linux kernel monikers used in sched.c. It won’t be until most systems are NUMA that they’ll wonder why they ever chose terms that connote physical movement (e.g., migrate, pull, etc) when dealing with placement of process structs into different queues/lists, etc. Nothing in sched.c actually “moves” anything. Migrating a process on a NUMA system, on the other hand, moves the pages from one node to another. Big difference…trivial pursuit…until, that is, the majority of servers are NUMA and people see what process migration, remote execution and so forth actually do to their workload.
BTW: what does “serialize CPU with CPUID” mean?
Hi Bernd,
The CPUID instruction does not complete until all other noise has settled such as pipelined execution and so forth. It is a way to ensure that what you are about to do happens in exactly the order you think it will. Now that I think about it, the pseudo code I put out there needs another CPUID before the second rdtsc().
Why would CLOCK_THREAD_CPUTIME_ID go backwards? It isn’t a wall clock timer, CLOCK_REALTIME and CLOCK_MONOTONIC do that. Ticks/nanoseconds/useconds of CPU time consumed by a process or thread should never go backwards. My man pages don’t match yours. They are even more vague. I think this is a bug in glibc or the Linux kernel.