In my blog “mini-series” about Oracle on Opteron NUMA, I am about to start covering the Linux 2.6 NUMA API and what it means to Oracle. I will share a lot of statspack information for certain, but first we need to go with micro-benchmark tests. The best micro-benchmark test for analysis of memory latency is one that uses the least amount of processor cycles to write memory that is most likely not in processor cache. That is, spend the fewest cycles to put the most load on the memory subsystem. To that end, I’d like to make available a SLB—a Silly Little Benchmark.
Introducing the Silly Little Benchmark
I took some old code of mine as the framework, stripped out all but the most simplistic code to make the points of memory latency and locality clear. Now, I’m not suggesting this SLB mimics Oracle at all. There are 3 significantly contentious aspects missing from this SLB that could bring it closer to what Oracle does to a system:
- Shared Memory
- Mutual Exclusion (e.g., spinlocks in shared memory)
- I/O
It just so happens that the larger code I ripped this SLB from does possess these three characteristics; however, I want to get the simplest form of it out there first. As this NUMA series progresses I’ll make other pieces available. Also, this version of the SLB is quite portable—it should work on about any modern Unix variant.
Where to Download the SLB
The SLB kit is stored in a tar archive here (slb.tar).
Description of the SLB
It is supposed to be very simple and it is. It consists of four parts:
- create_sem: as simple as it sounds. It create a single IPC semaphore in advance of memhammer.
- memhammer: the Silly Little Benchmark driver. It takes two arguments (without options). The first argument is the number of 4KB pages to allocate and the second is for how many loop iterations to perform
- trigger: all memhammer processes wait on the semaphore created by create_sem, this operates the semaphore to trigger a run
- cpu_bind: binds a process to a cpu
The first action for running the SLB is to execute create_sem. Next, fire off any number of memhammer processes up to the number of processors on the system. It makes no sense running more memhammer processes than processors in the machine. Each memhammer will use malloc(3) to allocate some heap, initializes it all with memset(3) and then wait on the semaphore created by create_sem. Next, execute trigger and the SLB will commence its work loop which loops through pseudo-random 4KB offsets in the malloc’ed memory and writes an 8 byte location within the first 64 bytes. Why 64 bytes? All the 64 bit systems I know of manage physical memory using a 64 byte cache line. As long as we write on any location residing entirely within a 64 byte line, we have caused as much work for the memory subsystem as we would if we scribbled on each of the eight 8-byte words the line can hold. Not scribbling over the entire line relieves us of the CPU overhead and allows us to put more duress on the memory subsystem—and that is the goal. SLB has a very small measured unit of work, but it causes maximum memory stalls. Well, not maximum, that would require spinlock contention, but this is good enough for this point of the NUMA mini-series.
Measuring Time
In prior lives, all of this sort of low-level measuring that I’ve done was performed with x86 assembly that reads the processor cycle counter—RDTSC. However, I’ve found it to be very inconsistent on multi-core AMD processors no matter how much fiddling I do to serialize the processor (e.g., with the CPUID instruction). It could just be me, I don’t know. It turns out that it is difficult to stop predictive reading of the TSC and I don’t have time to fiddle with it a pre-Socket F Opteron. When I finally get my hands on a Socket F Opteron system, I’ll change my measurement technique to RDTSCP which is an atomic set of instructions to serialize and read the time stamp counter correctly. Until then, I think performing millions upon millions of operations and then dividing by microsecond resolution gettimeofday(2) should be about sufficient. Any trip through the work loop that gets nailed by hardware interrupts will unfortunately increase the average but running the SLB on an otherwise idle system should be a pretty clean test.
Example Measurements
Getting ready for the SLB is quite simple. Simply extract and compile:
$ ls -l slb.tar
-rw-r–r– 1 root root 20480 Jan 26 10:12 slb.tar
$ tar xvf slb.tar
cpu_bind.c
create_sem.c
Makefile
memhammer.c
trigger.c
$ make
cc -c -o memhammer.o memhammer.c
cc -O -o memhammer memhammer.o
cc -c -o trigger.o trigger.c
cc -O -o trigger trigger.o
cc -c -o create_sem.o create_sem.c
cc -O -o create_sem create_sem.o
cc -c -o cpu_bind.o cpu_bind.c
cc -O -o cpu_bind cpu_bind.o
Some Quick Measurements
I used my DL585 with 4 dual-core Opteron 850s to test 1 single invocation of memhammer then compared it to 2 invocations on the same socket. The first test bound the execution to processor number 7 which executed the test in seconds 108.28 seconds with an average write latency of 68.8ns. The next test was executed with 2 invocations both on the same physical CPU. This caused the result to be a bit “lumpy.” The average of the two was 70.2ns—about 2% more than the single incovation on the same processor. For what it is worth, there was 2.4% average latency variation betweent he two concurrent invocations:
$ cat example1
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait
$ sh ./example1
Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2
Total ops 1572864000 Avg nsec/op 69.3 gettimeofday usec 108994268 TPUT ops/sec 14430703.8
Total ops 1572864000 Avg nsec/op 71.0 gettimeofday usec 111633529 TPUT ops/sec 14089530.4
The next test was to dedicate an entire socket to each of two concurrent invocations which really smoothed out the result. Executing this way resulted in less than 1/10th of 1 percent variance between the average write latencies:
$ cat example2
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait
$ sh ./example2
Total ops 1572864000 Avg nsec/op 69.1 gettimeofday usec 108640712 TPUT ops/sec 14477666.5
Total ops 1572864000 Avg nsec/op 69.2 gettimeofday usec 108871507 TPUT ops/sec 14446975.6
Cool And Quiet
Up to this point, the testing was done with the DL585s executing at 2200MHz. Since I have my DL585s set for dynamic power consumption adjustment, I can simply blast a SIGUSR2 at the cpuspeed processes. The spuspeed processes catch the SIGUSR2 and adjust the Pstate of the processor to the lowest power consumption—features supported by AMD Cool And Quiet Technology. The following shows how to determine what bounds the processor is fit to execute at. In my case, the processor will range from 2200MHz down to 1800MHz. Note, I recommend fixing the clock speed with the SIGUSR1 or SIGUSR2 before any performance testing. You might grow tired of inconsistent results. Note, there is no man page for the cpuspeed executable. You have to execute it to get a help message with command usage.
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2200000 2000000 1800000
# chkconfig –list cpuspeed
cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:on
# ps -ef | grep cpuspeed | grep -v grep
root 1796 1 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1797 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1798 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1799 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1800 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1801 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1802 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1803 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
# cpuspeed -h 2>&1 | tail -12
To have a CPU stay at the highest clock speed to maximize performance send the process controlling that CPU the SIGUSR1 signal.
To have a CPU stay at the lowest clock speed to maximize battery life send the process controlling that CPU the SIGUSR2 signal.
To resume having a CPU’s clock speed dynamically scaled send the process controlling that CPU the SIGHUP signal.
So, now that the clock speed has been adjusted down 18% from 2200MHz to 1800MHz, let’s see what example1 does:
$ sh ./example1
Total ops 1572864000 Avg nsec/op 81.6 gettimeofday usec 128352437 TPUT ops/sec 12254259.0
Total ops 1572864000 Avg nsec/op 77.0 gettimeofday usec 121043918 TPUT ops/sec 12994159.7
Total ops 1572864000 Avg nsec/op 82.8 gettimeofday usec 130198851 TPUT ops/sec 12080475.3
The slower clock rate brought the single invocation number up to 81.6ns average—18%.
With the next blog entry, I’ll start to use the SLB to point out NUMA characteristics of 4-way Opteron servers. After that it will be time for some real Oracle numbers. Please stay tuned.
Non-Linux Platforms
My old buddy Glenn Fawcett of Sun’s Strategic Applications Engineering Group has collected data from both SPARC and Opteron-based Sun servers. He said to use the following compiler options:
- -xarch=v9 … for sparc
- -xarch=amd64 for opteron
He and I had Martini’s the other night, but I forgot to ask him to run it on idle systems for real measurement purposes. Having said that, I’d really love to see numbers from anyone that cares to run this. It would be great to have the output of /proc/cpuinfo, and the server make/model. In fact, if someone can run this on a Socket F Opteron system I’d greatly appreciate it. It seems our Socket F systems are not slated to arrive here for a few more weeks.
Here are my results, I am curious as to what speed and OS your test system above is. This was run on a HP DL585 4 x AMD Opteron Dual core Processor 8224 SE. Running Red Hat Enterprise Linux AS release 4 (Nahant Update 6). Linux 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux.
I have turned CPUspeed off for all run levels.
———————————————-
tail of /proc/cpuinfo
processor : 7
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 8224 SE
stepping : 3
cpu MHz : 3215.171
cache size : 1024 KB
physical id : 3
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni cx16
bogomips : 6429.76
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp [4] [5]
——————————————————————-
The system has the following CPUs
cpu MHz : 3215.165
cpu MHz : 3215.165
cpu MHz : 3215.165
cpu MHz : 3215.165
cpu MHz : 3215.165
cpu MHz : 3215.165
cpu MHz : 3215.165
cpu MHz : 3215.165
The load average is
11:01:50 up 8 days, 21:05, 2 users, load average: 1.12, 1.04, 1.01
Test1
Total ops 1572864000 Avg nsec/op 71.3 gettimeofday usec 112200786 TPUT ops/sec 14018297.
5
Total ops 1572864000 Avg nsec/op 70.6 gettimeofday usec 111006651 TPUT ops/sec 14169096.
9
Total ops 1572864000 Avg nsec/op 71.3 gettimeofday usec 112184608 TPUT ops/sec 14020319.
1
Test2
Total ops 1572864000 Avg nsec/op 69.9 gettimeofday usec 109999747 TPUT ops/sec 14298796.
5
Total ops 1572864000 Avg nsec/op 71.3 gettimeofday usec 112177062 TPUT ops/sec 14021262.
2
Hi Kevin,
That was over a year ago, but I do recall it was a DL585 and I
believe the CPUs were 8820. The OS was indeed RHEL4, but U3.
OK, my home machine is Socket F based.
AMD Opteron
Twin CPU
Dual core
2.6GHz each. Model 2218
As detailed here:
http://www.oramoss.com/blog/2008/07/pc-for-manly-men.html
CPU Info:
[jeff@pedro closson_slb]$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2218
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 2000.06
clflush size : 64
power management: ts fid vid ttp tm stc
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2218
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 2000.06
clflush size : 64
power management: ts fid vid ttp tm stc
processor : 2
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2218
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 0
cpu cores : 2
apicid : 2
initial apicid : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 2000.06
clflush size : 64
power management: ts fid vid ttp tm stc
processor : 3
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2218
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 2000.06
clflush size : 64
power management: ts fid vid ttp tm stc
Other useful information:
[jeff@pedro closson_slb]$ uname -a
Linux pedro 2.6.27.21-170.2.56.fc10.i686.PAE #1 SMP Mon Mar 23 23:24:26 EDT 2009 i686 athlon i386 GNU/Linux
64 bit Fedora 10 installation.
[jeff@pedro closson_slb]$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2600000 2400000 2200000 2000000 1800000 1000000
[jeff@pedro closson_slb]$ chkconfig –list cpuspeed
cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:off
Using adaptations of your scripts, without any changes to CPU speed:
[jeff@pedro closson_slb]$ cat example1
./cpu_bind $$ 3
./create_sem
./memhammer 262144 6000 &
./trigger
wait
./cpu_bind $$ 3
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 2
./memhammer 262144 6000 &
./trigger
wait
sh ./example1
Total ops 1572864000 Avg nsec/op 75.3 gettimeofday usec 118407731 TPUT ops/sec 13283457.0
Total ops 1572864000 Avg nsec/op 79.9 gettimeofday usec 125714674 TPUT ops/sec 12511379.5
Total ops 1572864000 Avg nsec/op 80.1 gettimeofday usec 126037316 TPUT ops/sec 12479351.8
[jeff@pedro closson_slb]$ cat example2
./cpu_bind $$ 3
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./trigger
wait
sh ./example2
Total ops 1572864000 Avg nsec/op 77.1 gettimeofday usec 121257156 TPUT ops/sec 12971308.7
Total ops 1572864000 Avg nsec/op 111.1 gettimeofday usec 174757125 TPUT ops/sec 9000285.4
Anything else you want running, just ask.
Cheers
Jeff
Ooops…I installed the wrong OS didn’t I!
That would be Fedora 10 32 Bit installation on my 64 bit machine…doh! Read the uname -a results next time Jeff!
That will teach me to trust the (incorrect) label on the DVD!
I’m rebuilding the box with x86_64 FC10 and then I’ll repost the results to see what difference that makes…at least we’ll have a comparison!
More to follow…
Cheers
Jeff
it shouldn’t make that much difference actually.
Right, try again with x86_64 version of FC10:
[jeff@pedro network-scripts]$ uname -a
Linux pedro 2.6.27.5-117.fc10.x86_64 #1 SMP Tue Nov 18 11:58:53 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
(C’mon, I had to make sure this time!)
Same tests and scripts…as you predicted, not much difference in results between 32 bit and 64 bit FC10:
[jeff@pedro closson_slb]$ sh ./example1
Total ops 1572864000 Avg nsec/op 76.0 gettimeofday usec 119582344 TPUT ops/sec 13152978.5
Total ops 1572864000 Avg nsec/op 79.9 gettimeofday usec 125654959 TPUT ops/sec 12517325.3
Total ops 1572864000 Avg nsec/op 80.4 gettimeofday usec 126536800 TPUT ops/sec 12430091.5
[jeff@pedro closson_slb]$ sh ./example2
Total ops 1572864000 Avg nsec/op 74.2 gettimeofday usec 116724111 TPUT ops/sec 13475056.6
Total ops 1572864000 Avg nsec/op 74.4 gettimeofday usec 117040665 TPUT ops/sec 13438611.3
I’m not sure what happened to the second run of my earlier 32 bit environment where it was 111.1 nsec/op…perhaps I unwittingly, polluted it with other activity.
As before, anything else you want running on a socket F, just shout.
Cheers
Jeff
very cool for Socket F in a PC (presumably northbridge with 667MHz) …
how about trying the following in example2’s case:
./cpu_bind $$ 3
./create_sem
numactl —-membind=1 ./memhammer 262144 6000 &
./cpu_bind $$ 1
numactl –membind=0 ./memhammer 262144 6000 &
./trigger
wait
Like this:
[jeff@pedro closson_slb]$ cat example3
./cpu_bind $$ 3
./create_sem
numactl –membind=1 ./memhammer 262144 6000 &
./cpu_bind $$ 1
numactl –membind=0 ./memhammer 262144 6000 &
./trigger
wait
[jeff@pedro closson_slb]$ sh ./example3
Total ops 1572864000 Avg nsec/op 74.3 gettimeofday usec 116886057 TPUT ops/sec 13456386.8
Total ops 1572864000 Avg nsec/op 74.6 gettimeofday usec 117377932 TPUT ops/sec 13399997.5
Similar results to example 2.
Yes, it’s a Northbridge 667MHz.
This board: http://www.tyan.com/product_board_detail.aspx?pid=523
Cheers
Jeff
For giggles, I tried this on a dual six-core AMD Opteron 2427 at 2211MHz. Reformatted the results lightly and added a column tracking how many multiples the nsec/op was of the no-contention, single thread.
Total ops per thread: 786,432,000
12 threads:
Avg nsec/op | Times slower | gettimeofday usec | TPUT ops/sec
91.1 | x1.4 | 71651058 | 10,975,860.3
99.8 | x1.5 | 78490659 | 10,019,434.3
126.3 | x1.9 | 99320876 | 7,918,093.7
148.3 | x2.2 | 116648048 | 6,741,921.6
148.4 | x2.2 | 116737052 | 6,736,781.4
152.2 | x2.3 | 119704872 | 6,569,757.7
166.7 | x2.5 | 131092598 | 5,999,057.2
166.7 | x2.5 | 131127808 | 5,997,446.4
166.9 | x2.5 | 131260804 | 5,991,369.7
167.3 | x2.5 | 131569765 | 5,977,300.3
167.4 | x2.5 | 131669208 | 5,972,786.0
167.6 | x2.5 | 131819814 | 5,965,962.0
8 threads:
Avg nsec/op | Times slower | gettimeofday usec | TPUT ops/sec
80.9 | x1.2 | 63588337 | 12,367,551.0
81.2 | x1.2 | 63858997 | 12,315,132.4
81.5 | x1.2 | 64125987 | 12,263,858.0
137.8 | x2.0 | 108372538 | 7,256,746.2
138.0 | x2.1 | 108542637 | 7,245,374.0
138.1 | x2.1 | 108620756 | 7,240,163.2
138.2 | x2.1 | 108713999 | 7,233,953.4
138.2 | x2.1 | 108716868 | 7,233,762.5
138.6 | x2.1 | 108963247 | 7,217,406.1
4 threads:
Avg nsec/op | Times slower | gettimeofday usec | TPUT ops/sec
73.3 | x1.1 | 57625309 | 13,647,336.8
73.3 | x1.1 | 57635127 | 13,645,012.0
73.4 | x1.1 | 57684952 | 13,633,226.2
73.5 | x1.1 | 57782413 | 13,610,231.2
2 threads:
Avg nsec/op | Times slower | gettimeofday usec | TPUT ops/sec
68.8 | x1.0 | 54117286 | 14,531,992.6
70.8 | x1.1 | 55662442 | 14,128,593.2
1 thread:
Avg nsec/op | Times slower | gettimeofday usec | TPUT ops/sec
67.3 | n/a | 52910227 | 14,863,515.9
Ahh yes, the Magny-Cours. Is that DDR3 ?
Good Morning,
I need an Oracle Exadata Resource for my direct client in Manchester, CT.
The project is 6 months + and starts March 8, 2010.
Ideally, I need a Sr Oracle Exadata Consultant who has expertise with Oracle Servers, Data Warehouses,
DBA duties, Oracle 11g, and OLTP. Prior SUN Microsystems experience is preferred.
I can pay $100/hr all inclusive on a C2C or Best rate.
I wonder if you or perhaps a colleague might be interested.
If so, Please send a copy of your current resume in word doc format and contact information.