In my recent blog post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing the Silly Little Benchmark, I made available the SLB and hoped to get some folks to measure some other systems using the kit. Well, I got my first results back from a fellow member of the OakTable Network—Christian Antognini of Trivadis AG. I finally got to meet him face to face back in November 2006 at UKOUG.
Christian was nice enough to run it on a brand new Dell PE1900 with, yes, a quad-core “Clovertown” processor of the low-end E5320 variety. As packaged, this Clovertown-based system has a 1066MHz front side bus and the memory is configured with 4x1GB 667MHz dimms. The processor was clocked at 1.86GHz.
Here is a snippet of /proc/cpuinfo from Christian’s system:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
stepping : 7
cpu MHz : 1862.560
I asked Christian to run the SLB (memhammer) with 1, 2 and 4 threads of execution and to limit the amount of memory per process to 512MB. He submitted the following:
cha@helicon slb]$ cat example4.sh
./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./trigger
wait
./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./trigger
wait
./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 2
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./cpu_bind $$ 0
./memhammer 131072 6000 &
./trigger
wait
[cha@helicon slb]$ ./example4.sh
Total ops 786432000 Avg nsec/op 131.3 gettimeofday usec 103240338 TPUT ops/sec 7617487.7
Total ops 786432000 Avg nsec/op 250.4 gettimeofday usec 196953024 TPUT ops/sec 3992992.8
Total ops 786432000 Avg nsec/op 250.7 gettimeofday usec 197121780 TPUT ops/sec 3989574.4
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396010106 TPUT ops/sec 1985888.7
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396023560 TPUT ops/sec 1985821.2
Total ops 786432000 Avg nsec/op 504.6 gettimeofday usec 396854086 TPUT ops/sec 1981665.4
Total ops 786432000 Avg nsec/op 505.1 gettimeofday usec 397221522 TPUT ops/sec 1979832.3
So, while this is not a flagship Cloverdale Xeon (e.g., E5355), the latencies are pretty scary. Contrast these results with the DL585 Opteron 850 numbers I shared in this blog entry. The Opteron 850 is delivering 69ns with 2 concurrent threads of execution—some 47% quicker than this system exhibits with only 1 memhammer process running and the direct comparison of 2 concurrent memhammer processes is an astounding 3.6x slower than the Opteron 850 box. Here we see the true benefit of an on-die memory controller and the fact that Hypertransport is a true 1GHz path to memory. With 4 concurrent memhammer processes, the E5320 bogged down to 500ns! I’ll blog soon about what I see with the SLB on 4 sockets with my DL585 in the continuing NUMA series.
Other Numbers
I’d sure like to get numbers from others. How about Linux Itanium? How about Power system with AIX? How about some SPARC numbers. Anyone have a Soket F Opteron box they could collect SLB numbers on? If so, get the kit and run the same scripts Christian did on his Dell PE1900. Thanks Christian.
A Note About The SLB
Be sure to limit the memory allocation such that it does not cause major page faults or, eek, swapping. The first argument to memhamer is the number of 4KB pages to allocate.
Awesome stuff! Unreal, the change from 3 to 4 memhammer!
I’m dying to get my hands on some AIX gear in my new job: will definitely give this a try. Please do keep the kit available for a while.
You got it, Noons… of course you can download at anytime … I’m trying to get my hands on a System x 3950 right now…I’ve still got contacts at IBM 🙂
Kevin,
Thanks for your blog. The stuff you are covering is where I really feel ignorant and this is a great resource (and one of the few I have found). I am confused about one thing with your SLB test. It’s purpose is to analyze memory latency. For each test you document the CPU type (i.e. Opteron, Clovertown). What is the connection between processor and memory latency? How does this differ on single, dual, quad core?
Thanks.
Hi Henry,
Memory latency is nothing more that the time it takes to load or store a memory location. Letancy as a concept doesn’t differ based on core count. Different processors will staisfy their cores with varying memory latencies as per their architecture.
So is all RAM basically the same and the memory latency is a function of how the different processors satisfy their cores?
@Henry:
No, not at all. Allow me to try and fill in for Kevin:
Memory latency is a result of the architecture and technology used to make actual memory modules. And as a consequence: the physical connectivity of the memory subsystem built around such modules.
This is fixed for a particular type or technology of memory module.
This is what Kevin means when he says: “Memory latency is nothing more that the time it takes to load or store a memory location”.
As in: the time it takes for the memory subsystem – be that a chip or a set of chips or whatever – to store or retrieve a given memory location.
Totally independent of the architecture of the “core” processor(s). Hopefully.
Processors may have one technology or architecture to access their native L1 and L2 cache memory – “native” as in residing in the same “chip” and allowing their clock rates to go at top speed – and yet use a totally different architecture and/or technology when embedded in a practical, operational system, to access that system’s main memory, at a much slower rate than the native cache.
It’s the potential mismatch – also called “impedance mismatch” in electrical engineering parlance- between the two main “classes” of memory access and their relative speeds and synchronisation that becomes interesting. And causes the “cache stalls” Kevin talks about. And how well a given technology addresses this mismatch.
Kevin, please help extricate the foot off my mouth if that’s the case.
Thanks Noons, I think I’ve got it now. The memory latency depends on the memory subsystem (though you can’t measure it without having a processor in the loop, so no measurement is completely independent of CPU). One would expect this latency to remain the same regardless of how many processors are chugging away. This is not true because of the “impedance mismatch”. A good experiment should list both the memory and CPU. Also, seeing similar behavior (i.e. cache stalls) with one type of CPU running on various types of memory will give some insight into CPU behavior. Is that about right?
Henry
Henry,
The number of CPUs doesn’t affect latency as long as bus bandwidth does not get saturated and (MOST IMPORTANTLY) they aren’t constantly hammering the same memory line–as is the case with spinlocks. Processors stall for long periods when pounding on heavily contended memory holding spinlocks. This is why so much work goes into breaking out work into multiple locks and why lock alternatives to spinlocks are so attractive (e.g., queued locks, read-writer locks, RCU, etc).
Noons used the term “impedance mismatch” which is fine, but in the case of Opterons, the Hypertransport is clocked at the same rate as the processor. Very elegant stuff. But, it’s NUMA and that is why I’m blogging this thread.
Just for the sake of flashback, I recall running heavily contentious Oracle workloads with bus analyzers attached back in the Sequent days. The Orion chipset that supported the Pentium Pro processor routinely stalled for 19 bus cycles (yes bus cycles at 90MHz) when invalidating heavily contended cache lines (e.g., releasing a lock). Yes, releasing a lock is expensive. At least when there are other “interested parties”.
Trivial pursuit…