In my blog “mini-series” about Oracle on Opteron NUMA, I am about to start covering the Linux 2.6 NUMA API and what it means to Oracle. I will share a lot of statspack information for certain, but first we need to go with micro-benchmark tests. The best micro-benchmark test for analysis of memory latency is one that uses the least amount of processor cycles to write memory that is most likely not in processor cache. That is, spend the fewest cycles to put the most load on the memory subsystem. To that end, I’d like to make available a SLB—a Silly Little Benchmark.
Introducing the Silly Little Benchmark
I took some old code of mine as the framework, stripped out all but the most simplistic code to make the points of memory latency and locality clear. Now, I’m not suggesting this SLB mimics Oracle at all. There are 3 significantly contentious aspects missing from this SLB that could bring it closer to what Oracle does to a system:
- Shared Memory
- Mutual Exclusion (e.g., spinlocks in shared memory)
- I/O
It just so happens that the larger code I ripped this SLB from does possess these three characteristics; however, I want to get the simplest form of it out there first. As this NUMA series progresses I’ll make other pieces available. Also, this version of the SLB is quite portable—it should work on about any modern Unix variant.
Where to Download the SLB
The SLB kit is stored in a tar archive here (slb.tar).
Description of the SLB
It is supposed to be very simple and it is. It consists of four parts:
- create_sem: as simple as it sounds. It create a single IPC semaphore in advance of memhammer.
- memhammer: the Silly Little Benchmark driver. It takes two arguments (without options). The first argument is the number of 4KB pages to allocate and the second is for how many loop iterations to perform
- trigger: all memhammer processes wait on the semaphore created by create_sem, this operates the semaphore to trigger a run
- cpu_bind: binds a process to a cpu
The first action for running the SLB is to execute create_sem. Next, fire off any number of memhammer processes up to the number of processors on the system. It makes no sense running more memhammer processes than processors in the machine. Each memhammer will use malloc(3) to allocate some heap, initializes it all with memset(3) and then wait on the semaphore created by create_sem. Next, execute trigger and the SLB will commence its work loop which loops through pseudo-random 4KB offsets in the malloc’ed memory and writes an 8 byte location within the first 64 bytes. Why 64 bytes? All the 64 bit systems I know of manage physical memory using a 64 byte cache line. As long as we write on any location residing entirely within a 64 byte line, we have caused as much work for the memory subsystem as we would if we scribbled on each of the eight 8-byte words the line can hold. Not scribbling over the entire line relieves us of the CPU overhead and allows us to put more duress on the memory subsystem—and that is the goal. SLB has a very small measured unit of work, but it causes maximum memory stalls. Well, not maximum, that would require spinlock contention, but this is good enough for this point of the NUMA mini-series.
Measuring Time
In prior lives, all of this sort of low-level measuring that I’ve done was performed with x86 assembly that reads the processor cycle counter—RDTSC. However, I’ve found it to be very inconsistent on multi-core AMD processors no matter how much fiddling I do to serialize the processor (e.g., with the CPUID instruction). It could just be me, I don’t know. It turns out that it is difficult to stop predictive reading of the TSC and I don’t have time to fiddle with it a pre-Socket F Opteron. When I finally get my hands on a Socket F Opteron system, I’ll change my measurement technique to RDTSCP which is an atomic set of instructions to serialize and read the time stamp counter correctly. Until then, I think performing millions upon millions of operations and then dividing by microsecond resolution gettimeofday(2) should be about sufficient. Any trip through the work loop that gets nailed by hardware interrupts will unfortunately increase the average but running the SLB on an otherwise idle system should be a pretty clean test.
Example Measurements
Getting ready for the SLB is quite simple. Simply extract and compile:
$ ls -l slb.tar
-rw-r–r– 1 root root 20480 Jan 26 10:12 slb.tar
$ tar xvf slb.tar
cpu_bind.c
create_sem.c
Makefile
memhammer.c
trigger.c
$ make
cc -c -o memhammer.o memhammer.c
cc -O -o memhammer memhammer.o
cc -c -o trigger.o trigger.c
cc -O -o trigger trigger.o
cc -c -o create_sem.o create_sem.c
cc -O -o create_sem create_sem.o
cc -c -o cpu_bind.o cpu_bind.c
cc -O -o cpu_bind cpu_bind.o
Some Quick Measurements
I used my DL585 with 4 dual-core Opteron 850s to test 1 single invocation of memhammer then compared it to 2 invocations on the same socket. The first test bound the execution to processor number 7 which executed the test in seconds 108.28 seconds with an average write latency of 68.8ns. The next test was executed with 2 invocations both on the same physical CPU. This caused the result to be a bit “lumpy.” The average of the two was 70.2ns—about 2% more than the single incovation on the same processor. For what it is worth, there was 2.4% average latency variation betweent he two concurrent invocations:
$ cat example1
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait
$ sh ./example1
Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2
Total ops 1572864000 Avg nsec/op 69.3 gettimeofday usec 108994268 TPUT ops/sec 14430703.8
Total ops 1572864000 Avg nsec/op 71.0 gettimeofday usec 111633529 TPUT ops/sec 14089530.4
The next test was to dedicate an entire socket to each of two concurrent invocations which really smoothed out the result. Executing this way resulted in less than 1/10th of 1 percent variance between the average write latencies:
$ cat example2
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait
$ sh ./example2
Total ops 1572864000 Avg nsec/op 69.1 gettimeofday usec 108640712 TPUT ops/sec 14477666.5
Total ops 1572864000 Avg nsec/op 69.2 gettimeofday usec 108871507 TPUT ops/sec 14446975.6
Cool And Quiet
Up to this point, the testing was done with the DL585s executing at 2200MHz. Since I have my DL585s set for dynamic power consumption adjustment, I can simply blast a SIGUSR2 at the cpuspeed processes. The spuspeed processes catch the SIGUSR2 and adjust the Pstate of the processor to the lowest power consumption—features supported by AMD Cool And Quiet Technology. The following shows how to determine what bounds the processor is fit to execute at. In my case, the processor will range from 2200MHz down to 1800MHz. Note, I recommend fixing the clock speed with the SIGUSR1 or SIGUSR2 before any performance testing. You might grow tired of inconsistent results. Note, there is no man page for the cpuspeed executable. You have to execute it to get a help message with command usage.
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2200000 2000000 1800000
# chkconfig –list cpuspeed
cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:on
# ps -ef | grep cpuspeed | grep -v grep
root 1796 1 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1797 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1798 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1799 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1800 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1801 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1802 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1803 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
# cpuspeed -h 2>&1 | tail -12
To have a CPU stay at the highest clock speed to maximize performance send the process controlling that CPU the SIGUSR1 signal.
To have a CPU stay at the lowest clock speed to maximize battery life send the process controlling that CPU the SIGUSR2 signal.
To resume having a CPU’s clock speed dynamically scaled send the process controlling that CPU the SIGHUP signal.
So, now that the clock speed has been adjusted down 18% from 2200MHz to 1800MHz, let’s see what example1 does:
$ sh ./example1
Total ops 1572864000 Avg nsec/op 81.6 gettimeofday usec 128352437 TPUT ops/sec 12254259.0
Total ops 1572864000 Avg nsec/op 77.0 gettimeofday usec 121043918 TPUT ops/sec 12994159.7
Total ops 1572864000 Avg nsec/op 82.8 gettimeofday usec 130198851 TPUT ops/sec 12080475.3
The slower clock rate brought the single invocation number up to 81.6ns average—18%.
With the next blog entry, I’ll start to use the SLB to point out NUMA characteristics of 4-way Opteron servers. After that it will be time for some real Oracle numbers. Please stay tuned.
Non-Linux Platforms
My old buddy Glenn Fawcett of Sun’s Strategic Applications Engineering Group has collected data from both SPARC and Opteron-based Sun servers. He said to use the following compiler options:
- -xarch=v9 … for sparc
- -xarch=amd64 for opteron
He and I had Martini’s the other night, but I forgot to ask him to run it on idle systems for real measurement purposes. Having said that, I’d really love to see numbers from anyone that cares to run this. It would be great to have the output of /proc/cpuinfo, and the server make/model. In fact, if someone can run this on a Socket F Opteron system I’d greatly appreciate it. It seems our Socket F systems are not slated to arrive here for a few more weeks.
Like this:
Like Loading...
Recent Comments