Archive for the 'Opteron Oracle' Category

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part III.

By The Way, How Many NUMA Nodes Is Your AMD Opteron 6100-Based Server?

In my on-going series about Oracle Database 11g configuration for NUMA systems I’ve spoken of the enabling parameter and how it changed from _enable_NUMA_optimization (11.1) to _enable_NUMA_support (11.2). For convenience sake I’ll point to the other two posts in the series for folks that care to catch up.

What does AMD Opteron 6100 (Magny-Cours) have to do with my on-going series on enabling/disabling NUMA features in Oracle Database? That’s a good question. However, wouldn’t it be premature to just presume each of these 12-core processors is a NUMA node?

The AMD Opteron 6100 is a Multi-Chip Module (MCM). The “package” is two hex-core processors essentially “glued” together and placed into a socket. Each die has its own memory controller (hint, hint). I wonder what the Operating System sees in the case of a 4-socket server? Let’s take a peek.

The following is output from the numactl(8) command on a 4s48c Opteron 6100 (G34)-based server:

# numactl --hardware
available: 8 nodes (0-7)
node 0 size: 8060 MB
node 0 free: 7152 MB
node 1 size: 16160 MB
node 1 free: 16007 MB
node 2 size: 8080 MB
node 2 free: 8052 MB
node 3 size: 16160 MB
node 3 free: 15512 MB
node 4 size: 8080 MB
node 4 free: 8063 MB
node 5 size: 16160 MB
node 5 free: 15974 MB
node 6 size: 8080 MB
node 6 free: 8051 MB
node 7 size: 16160 MB
node 7 free: 15519 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  16  22  22  16
  2:  16  22  10  16  16  16  16  16
  3:  22  16  16  10  16  16  22  22
  4:  16  16  16  16  10  16  16  22
  5:  22  22  16  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  16  22  22  16  16  10

Heft
It wasn’t that long ago that an 8-node NUMA system was so large that a fork lift was necessary to move it about (think Sequent, SGI, DG, DEC etc). Even much more recent 8-socket (thus 8 NUMA nodes) servers were a 2-man lift and quite large (e.g., 7U HP Proliant DL785). These days, however, an 8-node NUMA system like the AMD Opteron 6100 (G34) comes in a 2U package!

Is it time yet to stop thinking that NUMA is niche technology?

I’ll blog soon about booting Oracle to test NUMA optimizations on these 8-node servers.

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB
SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

This installment in my series about Oracle on Linux with NUMA hardware is very, very late. I started this series at the end of last year and it just kept getting put off—mostly because the hardware I needed to use was being used for other projects (my own projects). This is the seventh in the series and it’s time to show some Oracle numbers. Previously, I laid groundwork about such topics as SUMA/NUMA, NUMA API and so forth. To make those points I relied on microbenchmarks such as the Silly Little Benchmark. The previous installments can be found here.

To bring home the point that Oracle should be run on AMD boxes in NUMA mode (as opposed to SUMA), I decided to pick an Oracle workload that is very easy to understand as well as processor intensive. After all, the difference between SUMA and NUMA is higher memory latency so testing at any level below processor saturation actually provides the same throughput-albeit the SUMA result would come at a higher processor cost. To that end, measuring SUMA and NUMA at processor saturation is the best way to see the difference.

The workload I’ll use for this testing is what my friend Anjo Kolk refers to as the Jonathan Lewis Oracle Computing Index workload. The workload comes in script form and is very straightforward. The important thing about the workload is that it hammers memory which, of course, is the best way to see the NUMA effect. Jonathan Lewis needs no introduction of course.

The test was set up to execute 4, 8 16 and 32 concurrent invocations of the JL Comp script. The only difference in the test setup was that in one case I booted the server in SUMA mode and in another I booted in NUMA mode and allocated hugepages. As I point out in this post about SUMA, hugepages are allocated in a NUMA fashion and booting an SGA into this memory offers at least crude fairness placement of the SGA pages—certainly much better than a Cyclops. In short, what is being tested here one case where memory is allocated at boot time in a completely round-robin fashion versus the SGA being quasi-round robin yet page tables, kernel-side process-related structures and heap are all NUMA-optimized. Remember, this is no more difficult than a system boot option. Let’s get to the numbers.

jlcomp.jpg

I have also rolled up all the statspack reports into a word document (as required by WordPress). The document is numa-statspack.doc and it consist of 8 statspacks each prefaced by the name of what the specific test was. If you pattern search for REPORT NAME you will see each entry. Since this is a simple memory latency improvement, you might not be surprised how uninteresting the stats are-except of course the vast improvement in the number of logical reads per second the NUMA tests were able to push through the system.

SUMA or NUMA
A picture speaks a thousand words. This simple test combined with this simple graph covers it all pretty well. The job complete time ranged from about 12 to 15 percent better with NUMA at each of the concurrent session counts. While 12 to 15% isn’t astounding, remember this workload is completely processor bound. How do you usually recuperate 12-15% from a totally processor-bound workload without changing even a single line of code? Besides, this is only one workload and the fact remains that the more your particular workload does outside the SGA (e.g., sorting, etc) the more likely you are to see improvement. But by all means, do not run Oracle with Cyclops memory.

The Moral of the Story

Processors are going to get more cores and slower clock rates and memory topologies will look a lot more NUMA than SUMA as time progresses. I think it is important to understand NUMA.

What is Oracle Doing About It?
Well, I’ve blogged about the fact that the Linux ports of 10g do not integrate with libnuma. That means it is not NUMA-aware. What I’ve tried to show in this series is that the world of NUMA is not binary. There is more to it than SUMA or NUMA-aware. In the middle is booting the server and database in a fashion that at least allows benefit from the OS-side NUMA-awareness. The next step is Oracle NUMA-awareness.

Just recently I was sitting in a developer’s office in bldg 400 of Oracle HQ talking about NUMA. It was a good conversation. He stated that Oracle actually has NUMA awareness in it and I said, “I know.” I don’t think Sequent was on his mind and I can’t blame him—that was a long time ago. The vestiges of NUMA awareness in Oracle 10g trace back to the high-end proprietary NUMA implementations of the 1990s.  So if “it’s in there” what’s missing? We both said vgetcpu() at the same time. You see, you can’t have Oracle making runtime decisions about local versus remote memory if a process doesn’t know what CPU it is currently executing on (detection with less than a handful of instructions).  Things like vgetcpu() seem to be coming along. That means once these APIs are fully baked, I think we’ll see Oracle resurrect intrinsic NUMA awareness in the Linux port of Oracle Database akin to those wildcat ports of the late 90s…and that should be a good thing.

Learn Danish Before You Learn About NUMA

I can’t speak Danish, but I have the next best thing—a Danish friend that speaks English. The Danish arm of Computer Reseller News has a video of Mogens Norgaard (founder of the OakTable Network of which I am glad to be a member). I have no idea whatsoever about what he is discussing, but since the video starts out with him pouring a beer I’m sure I’m missing out on something. No, hold it, I did get something. Featured prominently behind him is a well-used copy of my friend James Morle’s book Scaling Oracle8i.

By the way, if you want to be an Oracle expert, that book should be considered mandatory reading. I don’t care if it is based on Oracle8i, it is still rich with correct information. Also, if you are following my series on NUMA/Oracle, I particularly recommend section 8.1.2 which I contributed to this book. It covers the original NUMA port of Oracle—Sequent. Of particular interest should be the section on one of my only claims to fame: Quad-Local Buffer Preference.

I can’t recall, but perhaps that was the topic James and I were discussing in this photo Alex Gorbachev took at one of our pub stops during UKOUG 2006. Or, maybe we (James and I to the right in the photo) were discussing the guys to our right (Mogens and Thomas Presslie) who were wearing skirts—ur, uh, I mean kilts! I do recall that 5AM came early that morning. Not the best way to start my trip home.

Oracle on Opteron with Linux-The NUMA Angle (Part V). Introducing numactl(8) and SUMA. Is The Oracle x86_64 Linux Port NUMA Aware?

This blog entry is part five in a series. Please visit here for links to the previous installments.

Opteron-Based Servers are NUMA Systems
Or are they? It depends on how you boot them. For instance, I have 2 HP DL585 servers clustered with the PolyServe Database Utility for Oracle RAC. I booted one of the servers as a non-NUMA by tweaking the BIOS so that memory was interleaved on a 4KB basis. This is a memory model HP calls Sufficiently Uniform Memory Access (SUMA) as stated in this DL585 Technology Brief (pg. 6):

Node interleaving (SUMA) breaks memory into 4-KB addressable entities. Addressing starts with address 0 on node 0 and sequentially assigns through address 4095 to node 0, addresses 4096 through 8191 to node 1, addresses 8192 through 12287 to node 3, and addresses 12888

Booting in this fashion essentially turns an HP DL585 into a “flat-memory” SMP—or a SUMA in HP parlance. There seems to be conflicting monikers for using Opteron SMPs in this mode. IBM has a Redbook that covers the varying NUMA offerings in their System x portfolio. The abstract for this Redbook states:

The AMD Opteron implementation is called Sufficiently Uniform Memory Organization (SUMO) and is also a NUMA architecture. In the case of the Opteron, each processor has its own “local” memory with low latency. Every CPU can also access the memory of any other CPU in the system but at longer latency.

Whether it is SUMA or SUMO, the concept is cool, but a bit foreign to me given my NUMA background. The NUMA systems I worked on in the 90s consisted of distinct, separate small systems—each with their own memory and I/O cards, power supplies and so on. They were coupled into a single shared memory image with specialized hardware inserted into the system bus of each little system. These cards were linked together and the whole package was a cache coherent SMP (ccNUMA).

Is SUMA Recommended For Oracle?
Since the HP DL585 can be SUMA/SUMO, I thought I’d give it a test. But first I did a little research to see how most folks use these in the field. I know from the BIOS on my system that you actually get a warning and have to override it when setting up interleaved memory (SUMA). I also noticed that in one of HP’s Oracle Validated Configurations, the following statement is made:

Settings in the server BIOS adjusted to allow memory/node interleaving to work better with the ‘numa=off’ boot option

and:

Boot options
elevator=deadline numa=off

 

I found this to be strange, but I don’t yet fully understand why that recommendation is made. Why did they perform this validation with SUMA? When running a 4-socket Opteron system in SUMA mode, only 25% of all memory accesses will be to local memory. When I say all, I mean all—both user and kernel mode. The Linux 2.6 kernel is NUMA-aware so is seems like a waste to transform a NUMA system into a SUMA system? How can boiling down a NUMA system with interleaving (SUMA) possibly be optimal for Oracle? I will blog about this more as this series continues.

Is the x86_64 Linux Oracle Port NUMA Aware?
No, sorry, it is not. I might as well just come out and say it.

The NUMA API for Linux is very rudimentary compared to the boutique features in legacy NUMA systems like Sequent DYNIX/ptx and SGI IRIX, but it does support memory and process placement. I’ll blog later about this things it is missing that a NUMA aware Oracle port would require.

The Linux 2.6 kernel is NUMA aware, but what is there for applicaitons? The NUMA API which is implemented in the library called libnuma.so. But you don’t have to code to the API to effect NUMA awareness. The major 2.6 Linux kernel distributions (RHEL4 and SLES) ship with a command that uses the NUMA API in ways I’ll show later in this blog entry. The command is numactl(8) and it dynamically links to the NUMA API library (emphasis added by me):

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ type numactl
numactl is hashed (/usr/bin/numactl)
$ ldd /usr/bin/numactl
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003ba3200000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

Whereas the numactl(8) command links with libnuma.so, Oracle does not:

$ type oracle
oracle is /u01/app/oracle/product/10.2.0/db_1/bin/oracle
$ ldd /u01/app/oracle/product/10.2.0/db_1/bin/oracle
libskgxp10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxp10.so (0x0000002a95557000)
libhasgen10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libhasgen10.so (0x0000002a9565a000)
libskgxn2.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxn2.so (0x0000002a9584d000)
libocr10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocr10.so (0x0000002a9594f000)
libocrb10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrb10.so (0x0000002a95ab4000)
libocrutl10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrutl10.so (0x0000002a95bf0000)
libjox10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libjox10.so (0x0000002a95d65000)
libclsra10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libclsra10.so (0x0000002a96830000)
libdbcfg10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libdbcfg10.so (0x0000002a96938000)
libnnz10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libnnz10.so (0x0000002a96a55000)
libaio.so.1 => /usr/lib64/libaio.so.1 (0x0000002a96f15000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003ba3200000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003ba3400000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003ba3800000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003ba7300000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

No Big Deal, Right?
This NUMA stuff must just be a farce then, right? Let’s dig in. First, I’ll use the SLB (http://oaktable.net/getFile/148). Later I’ll move on to what fellow OakTable Network member Anjo Kolk and I refer to as the Jonathan Lewis Oracle Computing Index. The JL Oracle Computing Index is yet another microbenchmark that is very easy to run and compare memory throughput from one server to another using an Oracle workload. I’ll use this next to blog about NUMA effects/affects on a running instance of Oracle. After that I’ll move on to more robust Oracle OLTP and DSS workloads. But first, more SLB.

The SLB on SUMA/SOMA
First, let’s use the numactl(8) command to see what this DL585 looks like. Is it NUMA or SUMA?

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 1 nodes (0-0)
node 0 size: 32767 MB
node 0 free: 30640 MB

OK, this is a single node NUMA—or SUMA since it was booted with memory interleaving on. If it wasn’t for that boot option the command would report memory for all 4 “nodes” (nodes are sockets in the Opteron NUMA world). So, I set up a series of SLB tests as follows:

$ cat example1
echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two Threads, same core”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6

echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, same socket”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, different sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

echo “4 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./trigger
wait

echo “8 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./trigger
wait

And now the measurements:

$ sh ./example1
One thread
Total ops 1572864000 Avg nsec/op 71.5 gettimeofday usec 112433955 TPUT ops/sec 13989225.9
Two threads, same socket
Total ops 1572864000 Avg nsec/op 73.4 gettimeofday usec 115428009 TPUT ops/sec 13626363.4
Total ops 1572864000 Avg nsec/op 74.2 gettimeofday usec 116740373 TPUT ops/sec 13473179.5
Two threads, different sockets
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114759102 TPUT ops/sec 13705788.7
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114853095 TPUT ops/sec 13694572.2
4 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122879394 TPUT ops/sec 12800063.1
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122820373 TPUT ops/sec 12806214.2
Total ops 1572864000 Avg nsec/op 78.2 gettimeofday usec 123016921 TPUT ops/sec 12785753.3
Total ops 1572864000 Avg nsec/op 78.5 gettimeofday usec 123527864 TPUT ops/sec 12732868.1
8 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245773200 TPUT ops/sec 6399656.3
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245848989 TPUT ops/sec 6397683.4
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 245941009 TPUT ops/sec 6395289.7
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 246000176 TPUT ops/sec 6393751.5
Total ops 1572864000 Avg nsec/op 156.6 gettimeofday usec 246262366 TPUT ops/sec 6386944.2
Total ops 1572864000 Avg nsec/op 156.5 gettimeofday usec 246221624 TPUT ops/sec 6388001.1
Total ops 1572864000 Avg nsec/op 156.7 gettimeofday usec 246402465 TPUT ops/sec 6383312.8
Total ops 1572864000 Avg nsec/op 156.8 gettimeofday usec 246594031 TPUT ops/sec 6378353.9

SUMA baselines at 71.5ns average write operation and tops out at about 156ns with 8 concurrent threads of SLB execution (one per core). Let’s see what SLB on NUMA does.

SLB on NUMA
First, let’s get an idea what the memory layout is like:

$ uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 4 nodes (0-3)
node 0 size: 8191 MB
node 0 free: 5526 MB
node 1 size: 8191 MB
node 1 free: 6973 MB
node 2 size: 8191 MB
node 2 free: 7841 MB
node 3 size: 8191 MB
node 3 free: 7707 MB

OK, this means that there is approximately 5.5GB, 6.9GB, 7.8GB and 7.7GB of free memory on “nodes” 0, 1, 2 and 3 respectively. Why is the first node (node 0) lop-sided? I’ll tell you in the next blog entry. Let’s run some SLB. First, I’ll use numactl(8) to invoke memhammer with the directive that forces allocation of memory on a node-local basis. The first test is one memhammer process per socket:

$ cat ./membind_example.4
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ bash ./membind_example.4
Total ops 1572864000 Avg nsec/op 67.5 gettimeofday usec 106113673 TPUT ops/sec 14822444.2
Total ops 1572864000 Avg nsec/op 67.6 gettimeofday usec 106332351 TPUT ops/sec 14791961.1
Total ops 1572864000 Avg nsec/op 68.4 gettimeofday usec 107661537 TPUT ops/sec 14609340.0
Total ops 1572864000 Avg nsec/op 69.7 gettimeofday usec 109591100 TPUT ops/sec 14352114.4

This test is the same as the one above called “4 threads, 4 sockets” performed on the SOMA configuration where the latencies were 78ns. Switching from SOMA to NUMA and executing with NUMA placement brought the latencies down 13% to an average of 68ns. Interesting. Moreover, this test with 4 concurrent memhammer processes actually demonstrates better latencies than the single process average on SUMA which was 72ns. That comparison alone is quite interesting because it makes the point quite clear that SUMA in a 4-socket system is a 75% remote memory configuration—even for a single process like memhammer.

The next test was 2 memhammer processes per socket:

$ more membind_example.8
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ sh ./membind_example.8
Total ops 1572864000 Avg nsec/op 95.8 gettimeofday usec 150674658 TPUT ops/sec 10438809.2
Total ops 1572864000 Avg nsec/op 96.5 gettimeofday usec 151843720 TPUT ops/sec 10358439.6
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152368004 TPUT ops/sec 10322797.2
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152433799 TPUT ops/sec 10318341.5
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152436721 TPUT ops/sec 10318143.7
Total ops 1572864000 Avg nsec/op 97.0 gettimeofday usec 152635902 TPUT ops/sec 10304679.2
Total ops 1572864000 Avg nsec/op 97.2 gettimeofday usec 152819686 TPUT ops/sec 10292286.6
Total ops 1572864000 Avg nsec/op 97.6 gettimeofday usec 153494359 TPUT ops/sec 10247047.6

What’s that? Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API. No, of course an Oracle workload is not all random writes, but a system has to be able to handle the difficult aspects of a workload in order to offer good throughput. I won’t ask the rhetorical question of why Oracle is not NUMA aware in the x86_64 Linux ports until my next blog entry where the measurements will not be based on the SLB, but a real Oracle instance instead.

Déjà vu
Hold it. Didn’t the Dell PS1900 with a Clovertown Xeon quad-core E5320’s exhibit ~500ns latencies with only 4 concurrent threads of SLB execution (1 per core)? That was what was shown in this blog entry. Interesting.

I hope it is becoming clear why NUMA awareness is interesting. NUMA systems offer a great deal of potential incremental bandwidth when local memory is preferred over remote memory.

Next up—comparisons of SUMA versus NUMA with the Jonathan Lewis Computing Index and why all is not lost just because the 10gR2 x86_64 Linux port is not NUMA aware.

Oracle on Opteron with Linux-The NUMA Angle (Part IV). Some More About the Silly Little Benchmark.

 

 

In my recent blog post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing the Silly Little Benchmark, I made available the SLB and hoped to get some folks to measure some other systems using the kit. Well, I got my first results back from a fellow member of the OakTable Network—Christian Antognini of Trivadis AG. I finally got to meet him face to face back in November 2006 at UKOUG.

Christian was nice enough to run it on a brand new Dell PE1900 with, yes, a quad-core “Clovertown” processor of the low-end E5320 variety. As packaged, this Clovertown-based system has a 1066MHz front side bus and the memory is configured with 4x1GB 667MHz dimms. The processor was clocked at 1.86GHz.

Here is a snippet of /proc/cpuinfo from Christian’s system:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
stepping : 7
cpu MHz : 1862.560

I asked Christian to run the SLB (memhammer) with 1, 2 and 4 threads of execution and to limit the amount of memory per process to 512MB. He submitted the following:

 

cha@helicon slb]$ cat example4.sh
./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 2
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./cpu_bind $$ 0
./memhammer 131072 6000 &
./trigger
wait
[cha@helicon slb]$ ./example4.sh
Total ops 786432000 Avg nsec/op 131.3 gettimeofday usec 103240338 TPUT ops/sec 7617487.7
Total ops 786432000 Avg nsec/op 250.4 gettimeofday usec 196953024 TPUT ops/sec 3992992.8
Total ops 786432000 Avg nsec/op 250.7 gettimeofday usec 197121780 TPUT ops/sec 3989574.4
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396010106 TPUT ops/sec 1985888.7
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396023560 TPUT ops/sec 1985821.2
Total ops 786432000 Avg nsec/op 504.6 gettimeofday usec 396854086 TPUT ops/sec 1981665.4
Total ops 786432000 Avg nsec/op 505.1 gettimeofday usec 397221522 TPUT ops/sec 1979832.3

So, while this is not a flagship Cloverdale Xeon (e.g., E5355), the latencies are pretty scary. Contrast these results with the DL585 Opteron 850 numbers I shared in this blog entry. The Opteron 850 is delivering 69ns with 2 concurrent threads of execution—some 47% quicker than this system exhibits with only 1 memhammer process running and the direct comparison of 2 concurrent memhammer processes is an astounding 3.6x slower than the Opteron 850 box. Here we see the true benefit of an on-die memory controller and the fact that Hypertransport is a true 1GHz path to memory. With 4 concurrent memhammer processes, the E5320 bogged down to 500ns! I’ll blog soon about what I see with the SLB on 4 sockets with my DL585 in the continuing NUMA series.

Other Numbers
I’d sure like to get numbers from others. How about Linux Itanium? How about Power system with AIX? How about some SPARC numbers. Anyone have a Soket F Opteron box they could collect SLB numbers on? If so, get the kit and run the same scripts Christian did on his Dell PE1900. Thanks Christian.

A Note About The SLB
Be sure to limit the memory allocation such that it does not cause major page faults or, eek, swapping. The first argument to memhamer is the number of 4KB pages to allocate.

AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls.

This blog entry is the fourth in a series:

Oracle on Opteron with Linux–The NUMA Angle (Part I)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

Oracle on Opteron with Linux-The NUMA Angle (Part III)

It Really is All About The Core, Not the Processor (Socket)
In my post entitled AMD Quad-core “Barcelona” Processor For Oracle (Part III). NUMA Too!, I had to set a reader straight over his lack of understanding where the terms processor, core and socket are concerned. He followed up with:

kevin – you are correct. your math is fine. though, i may still disagree about core being a better term than “physical processor”, but that is neither here, nor there.

He continues:

my gut told me based upon working with servers and knowing both architectures your calculations were incorrect, instead i errored in my math as you pointed out. *but*, i did uncover an error in your logic that makes your case worthless.

So, I am replying here and now. His gut may just be telling him that he ate something bad, or it could be his conscience getting to him for mouthing off over at the investor village AMD board where he called me a moron. His self-proclaimed server expertise is not relevent here, nor is it likely the level he insinuates.

This is a blog about Oracle; I wish he’d get that through his head. Oracle licenses their flagship software (Real Application Clusters) at a list price of USD $60,000 per CPU. As I’ve pointed out, x86 cores are factored at .5 so a quad-core Barcelona will be 2 licenses—or $120,000 per socket. Today’s Tulsa processor licenses at $60,000 per socket and outperforms AMD’s projected Barcelona performance. AMD’s own promotional material suggests it will achieve a 70% OLTP (TPC-C) gain over today’s Opteron 2200. Sadly that is just not good enough where Oracle is concerned. I am a huge AMD fan, so this causes me grief.

Also, since he is such a server expert, he must certainly be aware that plugging a Barcelona processor into a Socket F board will need 70% headroom on the Hypertransport in order to attain that projected 70% OLTP increase. We aren’t talking about some CPU-only workload here, we are talking OLTP—as was AMD in that promotional video. OLTP hammers Hypertransport with tons of I/O, tons of contentious shared memory protected with spinlocks (a MESI snooping nightmare) and very large program text. I have seen no data anywhere suggesting this Socket F (Opteron 2200) TPC-C result of 139,693 TpmC was somehow achieved with 70% headroom to spare on the Hypertransport.

Specialized Hardware
Regarding the comparisons being made between the projected Barcelona numbers and today’s Xeon Tulsa, he states:

you are comparing a commodity chip with a specialized chip. those xeon processors in the ibm TPC have 16MB of L3 cache and cost about 6k a piece. amd most likely gave us the performance increase of the commodity version of barcelona, not a specialized version of barcelona. they specifically used it as a comparison, or upgrade of current socket TDP (65W,89W) parts.

What can I say about that? Specialized version of Barcelona? I’ve seen no indication of huge stepping plans, but that doesn’t matter. People run Oracle on specialized hardware. Period. If AMD had a “specialized” Barcelona in the plans, they wouldn’t have predicted a 70% increase over Opteron 2200—particularly not in a slide about OLTP using published TPC-C numbers from Opteron 2200 as the baseline. By the way, the only thing 16MB cache helps with in an Oracle workload is Oracle’s code footprint. Everything else is load/store operations and cache invalidations. The AMD caches are generally too small for that footprint, but the fact that the on-die memory controller is coupled with awesome memory latencies (due to Hypertransport), small cache size hasn’t mattered that much with Opteron 800 and Socket F—but only in comparison to older Xeon offerings. This whole blog thread has been about today’s Xeons and future Barcelona though.

Large L2/L3 Cache Systems with OLTP

Regarding Tulsa Xeon processors used in the IBM System x TPC-C result of 331,087 TpmC, he writes:

the benchmark likely runs in cache on the special case hardware.

Cache-bound TPC-C? Yes, now I am convinced that his gut wasn’t telling him anything useful. I’ve been talking about TPC-C. He, being a server expert, must surely know that TPC-C cannot execute in cache. That Tulsa Xeon number at 331,087 TpmC was attached to 1,008 36.4GB hard drives in a TotalStorage SAN. Does that sound like cache to anyone?

Tomorrow’s Technology Compared to Today’s Technology
He did call for a new comparison that is worth consideration:

we all know the p4 architecture is on the way out and intel has even put an end of line date on the architecture. compare the barcelon to woodcrest

So I’ll reciprocate, gladly. Today’s Clovertown ( 2 Woodcrest processors essentially glued together) has a TPC-C performance of 222,117 TpmC as seen in this audited Woodcrest TPC-C result. Being a quad-core processor, the Oracle licensing is 2 licenses per socket. That means today’s Woodcrest performance is 55,529 TpmC per Oracle license compared to the projected Barcelona performance of 59,369 TpmC per Oracle license. That means if you wait for Barcelona you could get 7% more bang for your Oracle buck than you can with today’s shipping Xeon quad-core technology. And, like I said, since Barcelona is going to get plugged into a Socket F board, I’m not very hopeful that the processor will get the required complement of bandwidth to achieve that projected 70% increase over Opteron 2200.

Now, isn’t this blogging stuff just a blast? And yes, unless AMD over-achieves on their current marketing projections for Barcelona performance, I’m going to be really bummed out.

Using Linux sched_setaffinity(2) To Bind Oracle Processes To CPUs

I have been exploring the effect of process migration between CPUs in a multi-core Linux system while running long duration Oracle jobs. While Linux does schedule processes as best as possible for L2 cache affinity, I do see migrations on my HP DL 585 Opteron 850 box. Cache affinity is important, and routine migrations can slow down long running jobs. In fact, when a process gets scheduled to run on a CPU different than the one it last ran on the CPU will stall immediately while the cache is loaded with the process’ page tables—regardless of cache warmth. That is, the cache might have pages of text, data, stack and shared memory, but it won’t have the right versions of the page tables. Bear in mind that we are talking really small stalls here, but on long running jobs it can add up.

CPU_BIND
This Linux Journal webpage has the source for a program called cpu_bind that uses the Linux 2.6 sched_setaffinity(2) library routine to establish hard-affinity for a process to a specified CPU. I’ll be covering more of this in my NUMA series, but I thought I’d make a quick blog entry about this new to get the ball rolling.

After downloading the cpu_bind.c program, it is simple to compile and execute. The following session shows compilation and execution to set the PID of my current bash(1) shell to execute with hard affinity on CPU 3:

$ cc -o cpu_bind cpu_bind.c
$ cpu_bind $$ 3
$ while true
> do
> :
> done

The following is a screen shot of top(1) with CPU 3 utilized 100% in user mode by my looping shell. Note, you may have to ricght-click->vew image:

top1

If you wanted to experiment with Oracle, you could start a long running job and execute cp_bind on its PID once it is running, or do what I did with $$ and then invoke sqlplus for instance. Also, a SQL*Net listener process could be started with hard affinity to a certain CPU and you could connect to it when running a long CPU-bound job. Just a thought, but I’ll be showing real numbers in my NUMA series soon.

Give it a thought, see what you think.

The NUMA series links are:

Oracle on Opteron with Linux–The NUMA Angle (Part I)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

A little more groundwork. Trust me, the Linux NUMA API discussion that is about to begin and the microbenchmark and Oracle benchmark tests will make a lot more sense with all this old boring stuff behind you.

Another Terminology Reminder
When discussing NUMA, the term node is not the same as in clusters. Remember that all the memory from all the nodes (or Quads, QBBs, RADs, etc) appear to all the processors as cache-coherent main memory.

More About NUMA Aware Software
As I mentioned in Oracle on Opteron with Linux–The NUMA Angle (Part I), NUMA awareness is a software term that refers to kernel and user mode software that makes intelligent decisions about how to best utilize resources in a NUMA system. I use the generic term resources because as I’ve pointed out, there is more to NUMA than just the non-uniform memory aspect. Yes, the acronym is Non Uniform Memory Access, but the architecture actually supports the notion of having building blocks with only processors and cache, only memory, or only I/O adaptors. It may sound really weird, but it is conceivable that a very specialized storage subsystem could be built and incorporated into a NUMA system by presenting itself as memory. Or, on the other hand, one could envision a very specialized memory component—no processors, just memory—that could be built into a NUMA system. For instance, think of a really large NVRAM device that presents itself as main memory in a NUMA system. That’s much different than an NVRAM card stuffed into something like a PCI bus and accessed with a device driver. Wouldn’t that be a great place to put an in-memory database for instance? Even a system crash would leave the contents in memory. Dealing with such topology requires the kernel to be aware of the differing memory topology that lies beneath it, and a robust user mode API so applications can allocate memory properly (you can’t just blindly malloc(3) yourself into that sort of thing). But alas, I digress since there is no such system commercially available. My intent was merely to expound on the architecture a bit in order to make the discussion of NUMA awareness more interesting.

In retrospect, these advanced NUMA topics are the reason I think Digital’s moniker for the building blocks used in the AlphaServer GS product line was the most appropriate. They used the acronym RAD (Resource Affinity Domain) which opens up the possible list of ingredients greatly. An API call would return RAD characteristics such as how many processors, how much memory (if any) and so on a RAD consisted of. Great stuff. I wonder how that compares to the Linux NUMA API? Hmm, I guess I better get to blogging…

When it comes to the current state of “commodity NUMA” (e.g., Opteron and Itanium) there are no such exotic concepts. Basically, these systems have processors and memory “nodes” with varying latency due to locality—but I/O is equally costly for all processors. I’ll speak mostly of Opteron NUMA with Linux since that is what I deal with the most and that is where I have Oracle running.

For the really bored, here is a link to a AlphaServer GS320 diagram.

The following is a diagram of the Sequent NUMA-Q components that interfaced with the SHV Xeon chipset to make systems with up to 64 processors:

lynx1.jpg

OK, I promise, the next NUMA blog entry will get into the Linux NUMA API and what it means to Oracle.

Oracle on Opteron with Linux–The NUMA Angle (Part I)

There are Horrible Definitions of NUMA Out There on the Web
I want to start blogging about NUMA with regard to Oracle because NUMA has reached the commodity hardware scene with Opteron and Hypertransport technology Yes, I know Opteron has been available for a long time, but it wasn’t until the Linux 2.6 Kernel that there were legitimate claims of the OS being NUMA-aware. Before I can start blogging about NUMA/Oracle on Opteron related topics, I need to lay down some groundwork.

First, I’ll just come out and say it, I know NUMA—really, really well. I spent the latter half of the 1990’s inside the Sequent Port of Oracle working out NUMA-optimizations to exploit Sequent NUMA-Q 2000—the first commercially available NUMA system. Yes, Data General, SGI and Digital were soon to follow with AViiON, Origin 2000 and the AlphaServer GS320 respectively. The first port of Oracle to have code within the kernel specifically exploiting NUMA architecture was the Sequent port of Oracle8i.

 

Glossary
I’d like to offer a couple of quick definitions. The only NUMA that matters where Oracle is concerned is Cache Coherent NUMA (a.k.a CC-NUMA):

NUMA – A microprocessor-based computer system architecture comprised of compute nodes that possess processors and memory and usually disk/network I/O cards. A CC-NUMA system has specialized hardware that presents all the varying memory components as a single memory image to the processors. This has historically been accomplished with crossbar, switch or SCI ring technologies. In the case of Opteron, NUMA is built into the processor since each processor has an on-die memory controller. Understanding how a memory reference is satisfied in a NUMA system is the most important aspect of understanding NUMA. Each memory address referenced by the processors in a NUMA system is essentially “snooped” by the “NUMA memory controller” which in turn determines if the memory is local to the processor or remote. If remote, the NUMA “engine” must perform a fetch of the memory and install it into the requesting processor cache (which cache depends on the implementation although most have historically implemented an L3 cache for this remote-memory “staging”). The NUMA “engine” has to be keenly tuned to the processor’s capabilities since all memory related operations have to be supported including cache line invalidations and so forth. Implementations have varied wildly since the early 1990s. There have been NUMA systems that were comprised of complete systems linked by a NUMA engine. One such example was the Sequent NUMA-Q 2000 which was built on commodity Intel-based Pentium systems “chained” together by a very specialized piece of hardware that attached directly to each system bus. That specialized hardware was the called the Lynx Card which had an OBIC (Orion Bus Interface Controller) and a SCLIC (SCI Line Interface Controller) as well as 128MB L3 remote cache. On the Lynx card was a 510-pin GaAs ASIC that served as the “data pump” of the NUMA “engine”. These commodity NUMA “building blocks” were called “Quads” because they had 4 processors, local memory, local network and disk I/O adaptors—a lot of them. Digital referred to their physical building blocks as QBB (Quad Building Blocks) and logically (in their API for instance) as“RAD”s for Resource Affinity Domains. In the case of Opteron, each processor is considered a “node” with only CPU and memory locality. With Opteron, network and disk I/O are uniform.

NUMA Aware – This term applies to software. NUMA-aware software is optimized for NUMA such that the topology is understood and runtime decisions can be made such as what segment of memory to allocate from or what adaptor to perform I/O through. The latter, of course, not applying to Opteron. NUMA awareness starts in the kernel and with a NUMA API, applications too can be made NUMA aware. The Linux 2.6 Kernel had NUMA awareness built into the kernel—to a certain extent and there has been a NUMA API available for just as long. Is the Kernel fully NUMA-optimized? Not by any stretch of the imagination. Is the API complete? No. Does that mean the Linux NUMA-related technology is worthless? That is what I intend to blog about.

Some of the good engineers that build NUMA-awareness into the Sequent NUMA-Q operating system—DYNIX/ptx—have contributed NUMA awareness to Linux through their work in the IBM Linux Technology Center. That is a good thing.

This thread on Opteron and Linux NUMA is going to be very Oracle-centric and will come out as a series of installments. But first, a trip down memory lane.

The NUMA Stink
In the year 2000, Sun was finishing a very anti-NUMA campaign. I remember vividly the job interview I had with Sun’s Performance, Availability and Architecture Engineering (PAE) Group lead by Ganesh Ramamurthy. Those were really good guys, I enjoyed the interview and I think I even regretted turning down their offer so I could instead work in the Veritas Database Editions Group on the Oracle Disk Manager Library. One of the prevailing themes during that interview was how hush, hush, wink, wink they were about using the term NUMA to describe forthcoming systems such as StarCat. That attitude even showed in the following Business Review Online article where the VP Enterprise Systems at Sun in that time frame stated:

“We don’t think of the StarCat as a NUMA or COMA server,” he said. “This server has SMP latencies, and it is just a bigger, badder Starfire.”

No, it most certainly isn’t a COMA (although it did implement a few of the aspects of COMA) and it most certainly has always been a NUMA. Oops, I forgot to define COMA…next entry…and, oh, Opteron has made saying NUMA cool again!

 


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 747 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: