Archive for the 'AMD Memory Throughput' Category

SQL Server on Linux and Windows Offers the Same Performance

This one really surprised me, mostly because I really don’t get a lot of exposure to Oracle Database on the Windows platform. Back in June I blogged about Oracle’s 10g Linux 100,000+ TPC-C result. That was a cool result for the types of reasons I blogged about then. However, it looks like that configuration has been pretty busy since the June announcement because today Oracle and HP announced an 11g TPC-C result on Windows of 102,454 TpmC on the same hardware-and achieved THE SAME THROUGHPUT! Well, give or take 1.5%. Here are links to the results:

That’s what I call platform parity, and I think it’s really cool. I wonder how SQL Server would perform in a Linux/Windows side-by-side benchmark? I know, I know, that was dopey.

By the way, I encourage you to take a peek at the full disclosure reports for these benchmarks. The Oracle configuration parameters were pretty straight forward. No voodoo.

Oops, I almost forgot to mention that the title of this blog entry was meant to be humorous.

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

This installment in my series about Oracle on Linux with NUMA hardware is very, very late. I started this series at the end of last year and it just kept getting put off—mostly because the hardware I needed to use was being used for other projects (my own projects). This is the seventh in the series and it’s time to show some Oracle numbers. Previously, I laid groundwork about such topics as SUMA/NUMA, NUMA API and so forth. To make those points I relied on microbenchmarks such as the Silly Little Benchmark. The previous installments can be found here.

To bring home the point that Oracle should be run on AMD boxes in NUMA mode (as opposed to SUMA), I decided to pick an Oracle workload that is very easy to understand as well as processor intensive. After all, the difference between SUMA and NUMA is higher memory latency so testing at any level below processor saturation actually provides the same throughput-albeit the SUMA result would come at a higher processor cost. To that end, measuring SUMA and NUMA at processor saturation is the best way to see the difference.

The workload I’ll use for this testing is what my friend Anjo Kolk refers to as the Jonathan Lewis Oracle Computing Index workload. The workload comes in script form and is very straightforward. The important thing about the workload is that it hammers memory which, of course, is the best way to see the NUMA effect. Jonathan Lewis needs no introduction of course.

The test was set up to execute 4, 8 16 and 32 concurrent invocations of the JL Comp script. The only difference in the test setup was that in one case I booted the server in SUMA mode and in another I booted in NUMA mode and allocated hugepages. As I point out in this post about SUMA, hugepages are allocated in a NUMA fashion and booting an SGA into this memory offers at least crude fairness placement of the SGA pages—certainly much better than a Cyclops. In short, what is being tested here one case where memory is allocated at boot time in a completely round-robin fashion versus the SGA being quasi-round robin yet page tables, kernel-side process-related structures and heap are all NUMA-optimized. Remember, this is no more difficult than a system boot option. Let’s get to the numbers.

jlcomp.jpg

I have also rolled up all the statspack reports into a word document (as required by WordPress). The document is numa-statspack.doc and it consist of 8 statspacks each prefaced by the name of what the specific test was. If you pattern search for REPORT NAME you will see each entry. Since this is a simple memory latency improvement, you might not be surprised how uninteresting the stats are-except of course the vast improvement in the number of logical reads per second the NUMA tests were able to push through the system.

SUMA or NUMA
A picture speaks a thousand words. This simple test combined with this simple graph covers it all pretty well. The job complete time ranged from about 12 to 15 percent better with NUMA at each of the concurrent session counts. While 12 to 15% isn’t astounding, remember this workload is completely processor bound. How do you usually recuperate 12-15% from a totally processor-bound workload without changing even a single line of code? Besides, this is only one workload and the fact remains that the more your particular workload does outside the SGA (e.g., sorting, etc) the more likely you are to see improvement. But by all means, do not run Oracle with Cyclops memory.

The Moral of the Story

Processors are going to get more cores and slower clock rates and memory topologies will look a lot more NUMA than SUMA as time progresses. I think it is important to understand NUMA.

What is Oracle Doing About It?
Well, I’ve blogged about the fact that the Linux ports of 10g do not integrate with libnuma. That means it is not NUMA-aware. What I’ve tried to show in this series is that the world of NUMA is not binary. There is more to it than SUMA or NUMA-aware. In the middle is booting the server and database in a fashion that at least allows benefit from the OS-side NUMA-awareness. The next step is Oracle NUMA-awareness.

Just recently I was sitting in a developer’s office in bldg 400 of Oracle HQ talking about NUMA. It was a good conversation. He stated that Oracle actually has NUMA awareness in it and I said, “I know.” I don’t think Sequent was on his mind and I can’t blame him—that was a long time ago. The vestiges of NUMA awareness in Oracle 10g trace back to the high-end proprietary NUMA implementations of the 1990s.  So if “it’s in there” what’s missing? We both said vgetcpu() at the same time. You see, you can’t have Oracle making runtime decisions about local versus remote memory if a process doesn’t know what CPU it is currently executing on (detection with less than a handful of instructions).  Things like vgetcpu() seem to be coming along. That means once these APIs are fully baked, I think we’ll see Oracle resurrect intrinsic NUMA awareness in the Linux port of Oracle Database akin to those wildcat ports of the late 90s…and that should be a good thing.

Oracle on Opteron with Linux-The NUMA Angle (Part IV). Some More About the Silly Little Benchmark.

 

 

In my recent blog post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing the Silly Little Benchmark, I made available the SLB and hoped to get some folks to measure some other systems using the kit. Well, I got my first results back from a fellow member of the OakTable Network—Christian Antognini of Trivadis AG. I finally got to meet him face to face back in November 2006 at UKOUG.

Christian was nice enough to run it on a brand new Dell PE1900 with, yes, a quad-core “Clovertown” processor of the low-end E5320 variety. As packaged, this Clovertown-based system has a 1066MHz front side bus and the memory is configured with 4x1GB 667MHz dimms. The processor was clocked at 1.86GHz.

Here is a snippet of /proc/cpuinfo from Christian’s system:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
stepping : 7
cpu MHz : 1862.560

I asked Christian to run the SLB (memhammer) with 1, 2 and 4 threads of execution and to limit the amount of memory per process to 512MB. He submitted the following:

 

cha@helicon slb]$ cat example4.sh
./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 2
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./cpu_bind $$ 0
./memhammer 131072 6000 &
./trigger
wait
[cha@helicon slb]$ ./example4.sh
Total ops 786432000 Avg nsec/op 131.3 gettimeofday usec 103240338 TPUT ops/sec 7617487.7
Total ops 786432000 Avg nsec/op 250.4 gettimeofday usec 196953024 TPUT ops/sec 3992992.8
Total ops 786432000 Avg nsec/op 250.7 gettimeofday usec 197121780 TPUT ops/sec 3989574.4
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396010106 TPUT ops/sec 1985888.7
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396023560 TPUT ops/sec 1985821.2
Total ops 786432000 Avg nsec/op 504.6 gettimeofday usec 396854086 TPUT ops/sec 1981665.4
Total ops 786432000 Avg nsec/op 505.1 gettimeofday usec 397221522 TPUT ops/sec 1979832.3

So, while this is not a flagship Cloverdale Xeon (e.g., E5355), the latencies are pretty scary. Contrast these results with the DL585 Opteron 850 numbers I shared in this blog entry. The Opteron 850 is delivering 69ns with 2 concurrent threads of execution—some 47% quicker than this system exhibits with only 1 memhammer process running and the direct comparison of 2 concurrent memhammer processes is an astounding 3.6x slower than the Opteron 850 box. Here we see the true benefit of an on-die memory controller and the fact that Hypertransport is a true 1GHz path to memory. With 4 concurrent memhammer processes, the E5320 bogged down to 500ns! I’ll blog soon about what I see with the SLB on 4 sockets with my DL585 in the continuing NUMA series.

Other Numbers
I’d sure like to get numbers from others. How about Linux Itanium? How about Power system with AIX? How about some SPARC numbers. Anyone have a Soket F Opteron box they could collect SLB numbers on? If so, get the kit and run the same scripts Christian did on his Dell PE1900. Thanks Christian.

A Note About The SLB
Be sure to limit the memory allocation such that it does not cause major page faults or, eek, swapping. The first argument to memhamer is the number of 4KB pages to allocate.

Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing The Silly Little Benchmark.

In my blog “mini-series” about Oracle on Opteron NUMA, I am about to start covering the Linux 2.6 NUMA API and what it means to Oracle. I will share a lot of statspack information for certain, but first we need to go with micro-benchmark tests. The best micro-benchmark test for analysis of memory latency is one that uses the least amount of processor cycles to write memory that is most likely not in processor cache. That is, spend the fewest cycles to put the most load on the memory subsystem. To that end, I’d like to make available a SLB—a Silly Little Benchmark.

Introducing the Silly Little Benchmark
I took some old code of mine as the framework, stripped out all but the most simplistic code to make the points of memory latency and locality clear. Now, I’m not suggesting this SLB mimics Oracle at all. There are 3 significantly contentious aspects missing from this SLB that could bring it closer to what Oracle does to a system:

  1. Shared Memory
  2. Mutual Exclusion (e.g., spinlocks in shared memory)
  3. I/O

It just so happens that the larger code I ripped this SLB from does possess these three characteristics; however, I want to get the simplest form of it out there first. As this NUMA series progresses I’ll make other pieces available. Also, this version of the SLB is quite portable—it should work on about any modern Unix variant.

Where to Download the SLB
The SLB kit is stored in a tar archive here (slb.tar).

Description of the SLB
It is supposed to be very simple and it is. It consists of four parts:

  • create_sem: as simple as it sounds. It create a single IPC semaphore in advance of memhammer.
  • memhammer: the Silly Little Benchmark driver. It takes two arguments (without options). The first argument is the number of 4KB pages to allocate and the second is for how many loop iterations to perform
  • trigger: all memhammer processes wait on the semaphore created by create_sem, this operates the semaphore to trigger a run
  • cpu_bind: binds a process to a cpu

The first action for running the SLB is to execute create_sem. Next, fire off any number of memhammer processes up to the number of processors on the system. It makes no sense running more memhammer processes than processors in the machine. Each memhammer will use malloc(3) to allocate some heap, initializes it all with memset(3) and then wait on the semaphore created by create_sem. Next, execute trigger and the SLB will commence its work loop which loops through pseudo-random 4KB offsets in the malloc’ed memory and writes an 8 byte location within the first 64 bytes. Why 64 bytes? All the 64 bit systems I know of manage physical memory using a 64 byte cache line. As long as we write on any location residing entirely within a 64 byte line, we have caused as much work for the memory subsystem as we would if we scribbled on each of the eight 8-byte words the line can hold. Not scribbling over the entire line relieves us of the CPU overhead and allows us to put more duress on the memory subsystem—and that is the goal. SLB has a very small measured unit of work, but it causes maximum memory stalls. Well, not maximum, that would require spinlock contention, but this is good enough for this point of the NUMA mini-series.

Measuring Time
In prior lives, all of this sort of low-level measuring that I’ve done was performed with x86 assembly that reads the processor cycle counter—RDTSC. However, I’ve found it to be very inconsistent on multi-core AMD processors no matter how much fiddling I do to serialize the processor (e.g., with the CPUID instruction). It could just be me, I don’t know. It turns out that it is difficult to stop predictive reading of the TSC and I don’t have time to fiddle with it a pre-Socket F Opteron.  When I finally get my hands on a Socket F Opteron system, I’ll change my measurement technique to RDTSCP which is an atomic set of instructions to serialize and read the time stamp counter correctly. Until then, I think performing millions upon millions of operations and then dividing by microsecond resolution gettimeofday(2) should be about sufficient. Any trip through the work loop that gets nailed by hardware interrupts will unfortunately increase the average but running the SLB on an otherwise idle system should be a pretty clean test.

Example Measurements
Getting ready for the SLB is quite simple. Simply extract and compile:

$ ls -l slb.tar
-rw-r–r– 1 root root 20480 Jan 26 10:12 slb.tar
$ tar xvf slb.tar
cpu_bind.c
create_sem.c
Makefile
memhammer.c
trigger.c
$ make

cc -c -o memhammer.o memhammer.c
cc -O -o memhammer memhammer.o
cc -c -o trigger.o trigger.c
cc -O -o trigger trigger.o
cc -c -o create_sem.o create_sem.c
cc -O -o create_sem create_sem.o
cc -c -o cpu_bind.o cpu_bind.c
cc -O -o cpu_bind cpu_bind.o

Some Quick Measurements
I used my DL585 with 4 dual-core Opteron 850s to test 1 single invocation of memhammer then compared it to 2 invocations on the same socket. The first test bound the execution to processor number 7 which executed the test in seconds 108.28 seconds with an average write latency of 68.8ns. The next test was executed with 2 invocations both on the same physical CPU. This caused the result to be a bit “lumpy.” The average of the two was 70.2ns—about 2% more than the single incovation on the same processor. For what it is worth, there was 2.4% average latency variation betweent he two concurrent invocations:

$ cat example1
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

$ sh ./example1

Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2

Total ops 1572864000 Avg nsec/op 69.3 gettimeofday usec 108994268 TPUT ops/sec 14430703.8

Total ops 1572864000 Avg nsec/op 71.0 gettimeofday usec 111633529 TPUT ops/sec 14089530.4

The next test was to dedicate an entire socket to each of two concurrent invocations which really smoothed out the result. Executing this way resulted in less than 1/10th of 1 percent variance between the average write latencies:

$ cat example2

./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

$ sh ./example2

Total ops 1572864000 Avg nsec/op 69.1 gettimeofday usec 108640712 TPUT ops/sec 14477666.5

Total ops 1572864000 Avg nsec/op 69.2 gettimeofday usec 108871507 TPUT ops/sec 14446975.6

Cool And Quiet
Up to this point, the testing was done with the DL585s executing at 2200MHz. Since I have my DL585s set for dynamic power consumption adjustment, I can simply blast a SIGUSR2 at the cpuspeed processes. The spuspeed processes catch the SIGUSR2 and adjust the Pstate of the processor to the lowest power consumption—features supported by AMD Cool And Quiet Technology. The following shows how to determine what bounds the processor is fit to execute at. In my case, the processor will range from 2200MHz down to 1800MHz. Note, I recommend fixing the clock speed with the SIGUSR1 or SIGUSR2 before any performance testing. You might grow tired of inconsistent results. Note, there is no man page for the cpuspeed executable. You have to execute it to get a help message with command usage.

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2200000 2000000 1800000

# chkconfig –list cpuspeed
cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:on

# ps -ef | grep cpuspeed | grep -v grep
root 1796 1 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1797 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1798 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1799 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1800 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1801 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1802 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1803 1796 0 11:23 ? 00:00:00 cpuspeed -d -n

# cpuspeed -h 2>&1 | tail -12

To have a CPU stay at the highest clock speed to maximize performance send the process controlling that CPU the SIGUSR1 signal.

To have a CPU stay at the lowest clock speed to maximize battery life send the process controlling that CPU the SIGUSR2 signal.

To resume having a CPU’s clock speed dynamically scaled send the process controlling that CPU the SIGHUP signal.

So, now that the clock speed has been adjusted down 18% from 2200MHz to 1800MHz, let’s see what example1 does:

$ sh ./example1
Total ops 1572864000 Avg nsec/op 81.6 gettimeofday usec 128352437 TPUT ops/sec 12254259.0
Total ops 1572864000 Avg nsec/op 77.0 gettimeofday usec 121043918 TPUT ops/sec 12994159.7
Total ops 1572864000 Avg nsec/op 82.8 gettimeofday usec 130198851 TPUT ops/sec 12080475.3

The slower clock rate brought the single invocation number up to 81.6ns average—18%.

With the next blog entry, I’ll start to use the SLB to point out NUMA characteristics of 4-way Opteron servers. After that it will be time for some real Oracle numbers. Please stay tuned.

Non-Linux Platforms
My old buddy Glenn Fawcett of Sun’s Strategic Applications Engineering Group has collected data from both SPARC and Opteron-based Sun servers. He said to use the following compiler options:

  • -xarch=v9 … for sparc
  • -xarch=amd64 for opteron

He and I had Martini’s the other night, but I forgot to ask him to run it on idle systems for real measurement purposes. Having said that, I’d really love to see numbers from anyone that cares to run this. It would be great to have the output of /proc/cpuinfo, and the server make/model. In fact, if someone can run this on a Socket F Opteron system I’d greatly appreciate it. It seems our Socket F systems are not slated to arrive here for a few more weeks.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 3,010 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: