Archive for the 'Opteron NUMA' Category

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part III.

By The Way, How Many NUMA Nodes Is Your AMD Opteron 6100-Based Server?

In my on-going series about Oracle Database 11g configuration for NUMA systems I’ve spoken of the enabling parameter and how it changed from _enable_NUMA_optimization (11.1) to _enable_NUMA_support (11.2). For convenience sake I’ll point to the other two posts in the series for folks that care to catch up.

What does AMD Opteron 6100 (Magny-Cours) have to do with my on-going series on enabling/disabling NUMA features in Oracle Database? That’s a good question. However, wouldn’t it be premature to just presume each of these 12-core processors is a NUMA node?

The AMD Opteron 6100 is a Multi-Chip Module (MCM). The “package” is two hex-core processors essentially “glued” together and placed into a socket. Each die has its own memory controller (hint, hint). I wonder what the Operating System sees in the case of a 4-socket server? Let’s take a peek.

The following is output from the numactl(8) command on a 4s48c Opteron 6100 (G34)-based server:

# numactl --hardware
available: 8 nodes (0-7)
node 0 size: 8060 MB
node 0 free: 7152 MB
node 1 size: 16160 MB
node 1 free: 16007 MB
node 2 size: 8080 MB
node 2 free: 8052 MB
node 3 size: 16160 MB
node 3 free: 15512 MB
node 4 size: 8080 MB
node 4 free: 8063 MB
node 5 size: 16160 MB
node 5 free: 15974 MB
node 6 size: 8080 MB
node 6 free: 8051 MB
node 7 size: 16160 MB
node 7 free: 15519 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  16  22  22  16
  2:  16  22  10  16  16  16  16  16
  3:  22  16  16  10  16  16  22  22
  4:  16  16  16  16  10  16  16  22
  5:  22  22  16  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  16  22  22  16  16  10

Heft
It wasn’t that long ago that an 8-node NUMA system was so large that a fork lift was necessary to move it about (think Sequent, SGI, DG, DEC etc). Even much more recent 8-socket (thus 8 NUMA nodes) servers were a 2-man lift and quite large (e.g., 7U HP Proliant DL785). These days, however, an 8-node NUMA system like the AMD Opteron 6100 (G34) comes in a 2U package!

Is it time yet to stop thinking that NUMA is niche technology?

I’ll blog soon about booting Oracle to test NUMA optimizations on these 8-node servers.

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part I.

In May 2009 I made a blog entry entitled You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II. There had not yet been a Part I but as I pointed out in that post I would loop back and make Part I. Here it is. Better late than never.

Background
I originally planned to use Part I to stroll down memory lane (back to 1995) with a story about the then VP of Oracle RDBMS Development’s initial impression about the Sequent DYNIX/ptx NUMA API during a session where we presented it and how it would be beneficial to code to NUMA APIs sooner rather than later. We were mixing vision with the specific need of our port to be honest.

We were the first to have a production NUMA API to which Oracle could port and we were quite a bit sooner to the whole NUMA trend than anyone else. Our’s was the first production NUMA system.

Now, this VP is no longer at Oracle but the  (redacted) response was, “Why would we want to use any of this ^#$%.”  We (me and the three others presenting the API) were caught off guard. However, we all knew that the question was a really good question. There were still good companies making really tight, high-end SMPs with uniform memory.  Just because we (Sequent) had to move into NUMA architecture didn’t mean we were blind to the reality around us. However, one thing we knew for sure—all systems in the future would have NUMA attributes of varying levels. All our competition was either in varying stages of denial or doing what I like to refer to as “Poo-pooh it while you do it.” All the major players eventually came out with NUMA systems.  Some sooner, some later and the others died trying.

That takes us to Commodity NUMA and the new purpose of this “Part I” post.

Before I say a word about this Part I I’d like to point out that the concepts in Part II are of a “must-know” variety unless you relinquish your computing power to some sort of hosted facility where you don’t have the luxury of caring about the architecture upon which you run Oracle Database.

Part II was about the different types of NUMA (historical and present) and such knowledge will help you if you find yourself in a troubling performance situation that relates to NUMA. NUMA is commodity, as I point out, and we have to come to grips with that.

What Is He Blogging About?
The current state of commodity NUMA is very peculiar. These Commodity NUMA Implementations (CNI) systems are so tightly coupled that most folks don’t even realize they are running on a NUMA system. In fact, let me go out on a ledge. I assert that nobody is configuring Oracle Database 11g Release 2 with NUMA optimizations in spite of the fact that they are on a NUMA box (e.g., Nehalem EP, AMD Opterton). The reason I believe this is because the init.ora parameter to invoke Oracle NUMA awareness changed names from 11gR1 to 11gR2 as per My Oracle Support note 864633.1. The parameter changed from _enable_NUMA_optimization to enable_NUMA_support. I know nobody is setting this because if they had I can almost guarantee they would have googled for problems. Allow me to explain.

If Nobody is Googling It, Nobody is Doing It
Anyone who tests _enable_NUMA_support as per My Oracle Support note 864633.1 will likely experience the sorts of problems that I detail later in this post. But first, let’s see what they would get from google when they search for _enable_NUMA_support:

Yes, just as I thought…Google found nothing. But what is my point? My point is two-fold. First, I happen to know that Nehalem EP  with QPI and Opteron with AMD HyperTransport are such good technologies that you really don’t have to care that much about NUMA software optimizations. At least to this point of the game. Reading M.O.S note 1053332.1 (regards disabling Linux NUMA support for Oracle Database Machine hosts) sort of drives that point home. However, saying you don’t need to care about NUMA doesn’t mean you shouldn’t experiment. How can anyone say that setting _enable_NUMA_support is a total placebo in all cases? One can’t prove a negative.

If you dare, trust me when I say that an understanding of NUMA will be as essential in the next 10 years as understanding SMP (parallelism and concurrency) was in the last 20 years. OK, off my soapbox.

Some Lessons in Enabling Oracle NUMA Optimizations with Oracle Database 11g Release 2
This section of the blog aims to point out that even when you think you might have tested Oracle NUMA optimizations there is a chance you didn’t. You have to know the way to ensure you have NUMA optimizations in play. Why? Well, if the configuration is not right for enabling NUMA features, Oracle Database will simply ignore you. Consider the following session where I demonstrate the following:

  1. Evidence that I am on a NUMA system (numactl(8))
  2. I started up an instance with a pfile (p4.ora) that has _enable_NUMA_support set to TRUE
  3. The instance started but _enable_NUMA_support was forced back to FALSE

Note, in spite of event #3, the alert log will not report anything to you about what went wrong.

SQL>
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 36317 MB
node 0 free: 31761 MB
node 1 size: 36360 MB
node 1 free: 35425 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

SQL> startup pfile=./p4.ora
ORACLE instance started.

Total System Global Area 5746786304 bytes
Fixed Size                  2213216 bytes
Variable Size            1207962272 bytes
Database Buffers         4294967296 bytes
Redo Buffers              241643520 bytes
Database mounted.
Database opened.
SQL> show parameter _enable_NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     FALSE

SQL>
SQL> !grep _enable_NUMA_support ./p4.ora
_enable_NUMA_support=TRUE

OK, so the instance is up and the parameter was reverted, what does the IPC shared memory segment look like?

SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x7393f7f4 1179653    oracle    660        5773459456 35
0x00000000 393223     oracle    644        790528     5          dest
0x00000000 425992     oracle    644        790528     5          dest
0x00000000 458761     oracle    644        790528     5          dest

Right, so I have no NUMA placement of the buffer pool. On Linux, Oracle must create multiple segments and allocate them on specific NUMA nodes (memory hierarchies). It was a little simpler for the first NUMA-aware port of Oracle (Sequent) since the APIs allowed for the creation of a single shared memory segment with regions of the segment placed onto different memories. Ho Hum.

What Went Wrong
Oracle could not find the libnuma.so it wanted to link with dlopen():

$ grep libnuma /tmp/strace.out | grep ENOENT | head
14626 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)
14627 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)

So I create the necessary symbolic link and subsequently boot the instance and inspect the shared memory segments. Here I see that I have a ~1GB segment for the variable SGA components and my buffer pool has been segmented into two roughly 2.3 GB segments.

# ls -l /usr/*64*/*numa*
lrwxrwxrwx 1 root root    23 Mar 17 09:25 /usr/lib64/libnuma.so -> /usr/lib64/libnuma.so.1
-rwxr-xr-x 1 root root 21752 Jul  7  2009 /usr/lib64/libnuma.so.1

SQL> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_size                        big integer 4G
SQL> show parameter NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     TRUE
SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x00000000 2719749    oracle    660        1006632960 35
0x00000000 2752518    oracle    660        2483027968 35
0x00000000 393223     oracle    644        790528     6          dest
0x00000000 425992     oracle    644        790528     6          dest
0x00000000 458761     oracle    644        790528     6          dest
0x00000000 2785290    oracle    660        2281701376 35
0x7393f7f4 2818059    oracle    660        2097152    35

So there I have an SGA successfully created with _enable_NUMA_support set to TRUE. But, what strings appear in the alert log? Well, I’ll blog that soon because it leads me to other content.

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB
SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

This installment in my series about Oracle on Linux with NUMA hardware is very, very late. I started this series at the end of last year and it just kept getting put off—mostly because the hardware I needed to use was being used for other projects (my own projects). This is the seventh in the series and it’s time to show some Oracle numbers. Previously, I laid groundwork about such topics as SUMA/NUMA, NUMA API and so forth. To make those points I relied on microbenchmarks such as the Silly Little Benchmark. The previous installments can be found here.

To bring home the point that Oracle should be run on AMD boxes in NUMA mode (as opposed to SUMA), I decided to pick an Oracle workload that is very easy to understand as well as processor intensive. After all, the difference between SUMA and NUMA is higher memory latency so testing at any level below processor saturation actually provides the same throughput-albeit the SUMA result would come at a higher processor cost. To that end, measuring SUMA and NUMA at processor saturation is the best way to see the difference.

The workload I’ll use for this testing is what my friend Anjo Kolk refers to as the Jonathan Lewis Oracle Computing Index workload. The workload comes in script form and is very straightforward. The important thing about the workload is that it hammers memory which, of course, is the best way to see the NUMA effect. Jonathan Lewis needs no introduction of course.

The test was set up to execute 4, 8 16 and 32 concurrent invocations of the JL Comp script. The only difference in the test setup was that in one case I booted the server in SUMA mode and in another I booted in NUMA mode and allocated hugepages. As I point out in this post about SUMA, hugepages are allocated in a NUMA fashion and booting an SGA into this memory offers at least crude fairness placement of the SGA pages—certainly much better than a Cyclops. In short, what is being tested here one case where memory is allocated at boot time in a completely round-robin fashion versus the SGA being quasi-round robin yet page tables, kernel-side process-related structures and heap are all NUMA-optimized. Remember, this is no more difficult than a system boot option. Let’s get to the numbers.

jlcomp.jpg

I have also rolled up all the statspack reports into a word document (as required by WordPress). The document is numa-statspack.doc and it consist of 8 statspacks each prefaced by the name of what the specific test was. If you pattern search for REPORT NAME you will see each entry. Since this is a simple memory latency improvement, you might not be surprised how uninteresting the stats are-except of course the vast improvement in the number of logical reads per second the NUMA tests were able to push through the system.

SUMA or NUMA
A picture speaks a thousand words. This simple test combined with this simple graph covers it all pretty well. The job complete time ranged from about 12 to 15 percent better with NUMA at each of the concurrent session counts. While 12 to 15% isn’t astounding, remember this workload is completely processor bound. How do you usually recuperate 12-15% from a totally processor-bound workload without changing even a single line of code? Besides, this is only one workload and the fact remains that the more your particular workload does outside the SGA (e.g., sorting, etc) the more likely you are to see improvement. But by all means, do not run Oracle with Cyclops memory.

The Moral of the Story

Processors are going to get more cores and slower clock rates and memory topologies will look a lot more NUMA than SUMA as time progresses. I think it is important to understand NUMA.

What is Oracle Doing About It?
Well, I’ve blogged about the fact that the Linux ports of 10g do not integrate with libnuma. That means it is not NUMA-aware. What I’ve tried to show in this series is that the world of NUMA is not binary. There is more to it than SUMA or NUMA-aware. In the middle is booting the server and database in a fashion that at least allows benefit from the OS-side NUMA-awareness. The next step is Oracle NUMA-awareness.

Just recently I was sitting in a developer’s office in bldg 400 of Oracle HQ talking about NUMA. It was a good conversation. He stated that Oracle actually has NUMA awareness in it and I said, “I know.” I don’t think Sequent was on his mind and I can’t blame him—that was a long time ago. The vestiges of NUMA awareness in Oracle 10g trace back to the high-end proprietary NUMA implementations of the 1990s.  So if “it’s in there” what’s missing? We both said vgetcpu() at the same time. You see, you can’t have Oracle making runtime decisions about local versus remote memory if a process doesn’t know what CPU it is currently executing on (detection with less than a handful of instructions).  Things like vgetcpu() seem to be coming along. That means once these APIs are fully baked, I think we’ll see Oracle resurrect intrinsic NUMA awareness in the Linux port of Oracle Database akin to those wildcat ports of the late 90s…and that should be a good thing.

Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops.

This is part 6 in a series about Oracle on Opteron-based NUMA servers running Linux. The list of prior installments can be found through my index of NUMA-related posts.

In part 5 of the series I discussed using Opteron-based servers with NUMA features disabled in the BIOS. Running an Opteron server (e.g., HP Proliant DL585) in this fashion is sometimes called SUMA (Sufficiently Uniform Memory Access) or SUMO (Sufficiently Uniform Memory Organization). At the risk of being controversial, I pointed out that in the Oracle Validated Configuration listing for Proliant, the recommendation is given to configure Opteron-based servers as SUMO/SUMA. In my experience, most folks do not change the BIOS and are therefore running a NUMA system since that is the default. However, if steps are taken to disable NUMA on an Opteron system, there are subtleties that warrant deeper understanding. How subtle are the subtleties? That question is the main theme of this blog series.

Memory Latencies with SUMA/SUMO vs NUMA
In part 5 of the series, I used the SLB memory latency workload to show how memory writes differ in NUMA versus SUMA/SUMO. I wrote:

Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API.

But What About Oracle?
What is the cost of running Oracle on SUMA? The simple answer is, it depends. More architectural background is needed before I go into that.

SUMA, NUMA and CYCLOPS
OK, so SUMA is what you get when you tweak a Proliant Opteron-based server so that memory is interleaved at the low level. Accompanying this with the setting of numa=off in the grub.conf file gets you a completely non-NUMA setup.

Cyclops
NUMA enabled in the BIOS, however, is the default. If the Oracle ports to Linux were NUMA-aware, that would be just fine. However, if the server isn’t configured as a SUMA and you boot Oracle without any consideration for the fact that you are on a NUMA system, you get what I call Cyclops. Let’s take a look at what I mean.

In the following screen shot I have booted an Oracle10g SGA of 7584MB on my Proliant DL585. The system is configured with 32GB physical memory which is, of course, 4 banks of 8GB each attached to one of the 4 dual-core Opterons (nodes). Before booting this SGA, I had between roughly 7.6GB and 7.7GB free memory on each of the memory banks. In the following figure it’s clear that after booting this 7584MB SGA I am left with all but 116MB of memory consumed from node 0 (socket 0)—Cyclops!

NOTE: You may need to right click->view the image

cyclops1

Right, so really bad things can happen if processes that are memory-resident on node 0 try to allocate more memory. In the 2.4 Kernel timeframe Red Hat points out such ill affect as OOM process termination in this web page. I haven’t spent much time researching how 2.6 responds to it because the point of this blog entry to not get into such a situation.

Let’s consider what things are like on a Cyclops even if there are no process or memory allocation failures. Let’s say, for instance, there is a listener with soft node affinity to node 2. All the sessions it forks off will have node affinity to node 2 where they will be granted pages for their kernel structures, page tables, stack, heap and so on. However, the entire SGA is remote memory since as you can see all the memory for the SGA was allocated from node 0. That is, um, not good.

Hugepages Are More Attractive Than Cyclops
Cyclops pops up its ugly single-eyed head only when you are running NUMA (not SUMA/SOMA) and fail to allocate/use hugepages. Whether you allocate hugepages off the grub boot line or out of sysctl.conf, memory for hugepages is allocated in a distributed fashion from the varying memory banks. Did I say round-robin? No. Because I don’t yet know whether it is round-robin or segmented. I have to leave something to blog about in the future.

The following is a screen shot of a session where I allocated 3800 2MB hugepages after the system was booted by echoing that value into /proc/sys/vm/nr_hugepages. Notice that unlike Cyclops, the pages are allocated for Oracle’s future use in a more distributed fashion from the various memory banks. I then booted Oracle. No Cyclops here.

hugepages

Interleaving NUMA Memory Allocation
The numactl(8) command supports the notion of pushing memory allocation preferences down to its children. Until such time as the Linux port of Oracle is NUMA-aware internally—as was done in the Sequent DYNIX/ptx, SGI, DG, and to a lesser degree the Solaris Oracle10g port with MPO—the best hopes for efficient memory usage on a commodity NUMA system is to interleave the placement of shared memory via numactl(8). With the SGA allocated in this fashion on a 4-socket NUMA system, Oracle’s memory accesses for the variable and buffer pool components will have locality of up to 25%–generally speaking. Yes, I’m sure some session could go crazy with logical reads of 2 buffers 20,000 times per second or some pathological situation, but I am trying to cover the topic in more general terms. You might wonder how this differs from SUMA/SOMA though.

With SUMA, all memory is interleaved. That means even the NUMA-aware Linux 2.6 kernel cannot exploit the hardware architecture by allocating structures with respect to the memory hierarchies. That is a pure waste. Moreover, with SUMA, 100% of your Oracle memory accesses will hit interleaved memory. That includes PGA. In contrast, properly allocated NUMA-interleaved hugepages results in fairness in the SGA placement, but allocation in the PGA (heap) and stack for the sessions are 100% local memory! That is a good thing. In the following screen shot I coupled numactl(8) memory interleaving with hugepages.

interleave

Validated Oracle Configuration
As I pointed out, this Oracle Validated Configuration listing for Proliant recommends turning off NUMA. Now that I’m an HP employee, I’ll have to pursue that a bit because I don’t agree with it at all. You’ll see why when I post my performance measurements contrasting NUMA (with interleave hugepages) to SUMA/SOMA. Look at that Validated Configuration web page closely and you’ll see a recommendation to allow Oracle to use hugepages by tuning /etc/security/limits.conf, but neither allocation of hugepages from the grub boot line nor via the sysctl.conf file!

Could it be that the recommendations in this Validated Configuration were a knee-jerk reaction to Cyclops? I’m not much of a betting man, but I’d wager $5.00 that was the case. Like I said, I’m in HP now…I’ll have to see what all that was about.

Up Next
In my next installment, I will provide Oracle measurements contrasting SUMA and NUMA. I know I’ve said this would be the installment with Oracle performance numbers, but I had to lay too much ground work in this post. The mind can only absorb what the seat can endure.

Patent Infringement
For all you folks that hate the concept of software patents, here’s a good one. When my Sequent colleagues and I were working out the OS-requirements to support our NUMA-optimizations of the Oracle 8 port to Sequent’s NUMA-Q system, we knew early on we’d need a very rich set of enhancements to shmget() for memory region placement. So we specified the requirements to our OS developers. Lo and behold U.S. Patent 6,505,286 plopped out. So, for extra credit, can someone explain to me how the Linux 2.6 libnuma call numa_alloc_onnode() (described here) is not in complete violation of that patent? Hmmm…

Now for a real taste of NUMA-Oracle history, read the following: Sequent_NUMA_Oracle8i

Oracle on Opteron with Linux-The NUMA Angle (Part V). Introducing numactl(8) and SUMA. Is The Oracle x86_64 Linux Port NUMA Aware?

This blog entry is part five in a series. Please visit here for links to the previous installments.

Opteron-Based Servers are NUMA Systems
Or are they? It depends on how you boot them. For instance, I have 2 HP DL585 servers clustered with the PolyServe Database Utility for Oracle RAC. I booted one of the servers as a non-NUMA by tweaking the BIOS so that memory was interleaved on a 4KB basis. This is a memory model HP calls Sufficiently Uniform Memory Access (SUMA) as stated in this DL585 Technology Brief (pg. 6):

Node interleaving (SUMA) breaks memory into 4-KB addressable entities. Addressing starts with address 0 on node 0 and sequentially assigns through address 4095 to node 0, addresses 4096 through 8191 to node 1, addresses 8192 through 12287 to node 3, and addresses 12888

Booting in this fashion essentially turns an HP DL585 into a “flat-memory” SMP—or a SUMA in HP parlance. There seems to be conflicting monikers for using Opteron SMPs in this mode. IBM has a Redbook that covers the varying NUMA offerings in their System x portfolio. The abstract for this Redbook states:

The AMD Opteron implementation is called Sufficiently Uniform Memory Organization (SUMO) and is also a NUMA architecture. In the case of the Opteron, each processor has its own “local” memory with low latency. Every CPU can also access the memory of any other CPU in the system but at longer latency.

Whether it is SUMA or SUMO, the concept is cool, but a bit foreign to me given my NUMA background. The NUMA systems I worked on in the 90s consisted of distinct, separate small systems—each with their own memory and I/O cards, power supplies and so on. They were coupled into a single shared memory image with specialized hardware inserted into the system bus of each little system. These cards were linked together and the whole package was a cache coherent SMP (ccNUMA).

Is SUMA Recommended For Oracle?
Since the HP DL585 can be SUMA/SUMO, I thought I’d give it a test. But first I did a little research to see how most folks use these in the field. I know from the BIOS on my system that you actually get a warning and have to override it when setting up interleaved memory (SUMA). I also noticed that in one of HP’s Oracle Validated Configurations, the following statement is made:

Settings in the server BIOS adjusted to allow memory/node interleaving to work better with the ‘numa=off’ boot option

and:

Boot options
elevator=deadline numa=off

 

I found this to be strange, but I don’t yet fully understand why that recommendation is made. Why did they perform this validation with SUMA? When running a 4-socket Opteron system in SUMA mode, only 25% of all memory accesses will be to local memory. When I say all, I mean all—both user and kernel mode. The Linux 2.6 kernel is NUMA-aware so is seems like a waste to transform a NUMA system into a SUMA system? How can boiling down a NUMA system with interleaving (SUMA) possibly be optimal for Oracle? I will blog about this more as this series continues.

Is the x86_64 Linux Oracle Port NUMA Aware?
No, sorry, it is not. I might as well just come out and say it.

The NUMA API for Linux is very rudimentary compared to the boutique features in legacy NUMA systems like Sequent DYNIX/ptx and SGI IRIX, but it does support memory and process placement. I’ll blog later about this things it is missing that a NUMA aware Oracle port would require.

The Linux 2.6 kernel is NUMA aware, but what is there for applicaitons? The NUMA API which is implemented in the library called libnuma.so. But you don’t have to code to the API to effect NUMA awareness. The major 2.6 Linux kernel distributions (RHEL4 and SLES) ship with a command that uses the NUMA API in ways I’ll show later in this blog entry. The command is numactl(8) and it dynamically links to the NUMA API library (emphasis added by me):

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ type numactl
numactl is hashed (/usr/bin/numactl)
$ ldd /usr/bin/numactl
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003ba3200000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

Whereas the numactl(8) command links with libnuma.so, Oracle does not:

$ type oracle
oracle is /u01/app/oracle/product/10.2.0/db_1/bin/oracle
$ ldd /u01/app/oracle/product/10.2.0/db_1/bin/oracle
libskgxp10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxp10.so (0x0000002a95557000)
libhasgen10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libhasgen10.so (0x0000002a9565a000)
libskgxn2.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxn2.so (0x0000002a9584d000)
libocr10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocr10.so (0x0000002a9594f000)
libocrb10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrb10.so (0x0000002a95ab4000)
libocrutl10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrutl10.so (0x0000002a95bf0000)
libjox10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libjox10.so (0x0000002a95d65000)
libclsra10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libclsra10.so (0x0000002a96830000)
libdbcfg10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libdbcfg10.so (0x0000002a96938000)
libnnz10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libnnz10.so (0x0000002a96a55000)
libaio.so.1 => /usr/lib64/libaio.so.1 (0x0000002a96f15000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003ba3200000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003ba3400000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003ba3800000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003ba7300000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

No Big Deal, Right?
This NUMA stuff must just be a farce then, right? Let’s dig in. First, I’ll use the SLB (http://oaktable.net/getFile/148). Later I’ll move on to what fellow OakTable Network member Anjo Kolk and I refer to as the Jonathan Lewis Oracle Computing Index. The JL Oracle Computing Index is yet another microbenchmark that is very easy to run and compare memory throughput from one server to another using an Oracle workload. I’ll use this next to blog about NUMA effects/affects on a running instance of Oracle. After that I’ll move on to more robust Oracle OLTP and DSS workloads. But first, more SLB.

The SLB on SUMA/SOMA
First, let’s use the numactl(8) command to see what this DL585 looks like. Is it NUMA or SUMA?

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 1 nodes (0-0)
node 0 size: 32767 MB
node 0 free: 30640 MB

OK, this is a single node NUMA—or SUMA since it was booted with memory interleaving on. If it wasn’t for that boot option the command would report memory for all 4 “nodes” (nodes are sockets in the Opteron NUMA world). So, I set up a series of SLB tests as follows:

$ cat example1
echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two Threads, same core”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6

echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, same socket”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, different sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

echo “4 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./trigger
wait

echo “8 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./trigger
wait

And now the measurements:

$ sh ./example1
One thread
Total ops 1572864000 Avg nsec/op 71.5 gettimeofday usec 112433955 TPUT ops/sec 13989225.9
Two threads, same socket
Total ops 1572864000 Avg nsec/op 73.4 gettimeofday usec 115428009 TPUT ops/sec 13626363.4
Total ops 1572864000 Avg nsec/op 74.2 gettimeofday usec 116740373 TPUT ops/sec 13473179.5
Two threads, different sockets
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114759102 TPUT ops/sec 13705788.7
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114853095 TPUT ops/sec 13694572.2
4 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122879394 TPUT ops/sec 12800063.1
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122820373 TPUT ops/sec 12806214.2
Total ops 1572864000 Avg nsec/op 78.2 gettimeofday usec 123016921 TPUT ops/sec 12785753.3
Total ops 1572864000 Avg nsec/op 78.5 gettimeofday usec 123527864 TPUT ops/sec 12732868.1
8 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245773200 TPUT ops/sec 6399656.3
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245848989 TPUT ops/sec 6397683.4
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 245941009 TPUT ops/sec 6395289.7
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 246000176 TPUT ops/sec 6393751.5
Total ops 1572864000 Avg nsec/op 156.6 gettimeofday usec 246262366 TPUT ops/sec 6386944.2
Total ops 1572864000 Avg nsec/op 156.5 gettimeofday usec 246221624 TPUT ops/sec 6388001.1
Total ops 1572864000 Avg nsec/op 156.7 gettimeofday usec 246402465 TPUT ops/sec 6383312.8
Total ops 1572864000 Avg nsec/op 156.8 gettimeofday usec 246594031 TPUT ops/sec 6378353.9

SUMA baselines at 71.5ns average write operation and tops out at about 156ns with 8 concurrent threads of SLB execution (one per core). Let’s see what SLB on NUMA does.

SLB on NUMA
First, let’s get an idea what the memory layout is like:

$ uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 4 nodes (0-3)
node 0 size: 8191 MB
node 0 free: 5526 MB
node 1 size: 8191 MB
node 1 free: 6973 MB
node 2 size: 8191 MB
node 2 free: 7841 MB
node 3 size: 8191 MB
node 3 free: 7707 MB

OK, this means that there is approximately 5.5GB, 6.9GB, 7.8GB and 7.7GB of free memory on “nodes” 0, 1, 2 and 3 respectively. Why is the first node (node 0) lop-sided? I’ll tell you in the next blog entry. Let’s run some SLB. First, I’ll use numactl(8) to invoke memhammer with the directive that forces allocation of memory on a node-local basis. The first test is one memhammer process per socket:

$ cat ./membind_example.4
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ bash ./membind_example.4
Total ops 1572864000 Avg nsec/op 67.5 gettimeofday usec 106113673 TPUT ops/sec 14822444.2
Total ops 1572864000 Avg nsec/op 67.6 gettimeofday usec 106332351 TPUT ops/sec 14791961.1
Total ops 1572864000 Avg nsec/op 68.4 gettimeofday usec 107661537 TPUT ops/sec 14609340.0
Total ops 1572864000 Avg nsec/op 69.7 gettimeofday usec 109591100 TPUT ops/sec 14352114.4

This test is the same as the one above called “4 threads, 4 sockets” performed on the SOMA configuration where the latencies were 78ns. Switching from SOMA to NUMA and executing with NUMA placement brought the latencies down 13% to an average of 68ns. Interesting. Moreover, this test with 4 concurrent memhammer processes actually demonstrates better latencies than the single process average on SUMA which was 72ns. That comparison alone is quite interesting because it makes the point quite clear that SUMA in a 4-socket system is a 75% remote memory configuration—even for a single process like memhammer.

The next test was 2 memhammer processes per socket:

$ more membind_example.8
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ sh ./membind_example.8
Total ops 1572864000 Avg nsec/op 95.8 gettimeofday usec 150674658 TPUT ops/sec 10438809.2
Total ops 1572864000 Avg nsec/op 96.5 gettimeofday usec 151843720 TPUT ops/sec 10358439.6
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152368004 TPUT ops/sec 10322797.2
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152433799 TPUT ops/sec 10318341.5
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152436721 TPUT ops/sec 10318143.7
Total ops 1572864000 Avg nsec/op 97.0 gettimeofday usec 152635902 TPUT ops/sec 10304679.2
Total ops 1572864000 Avg nsec/op 97.2 gettimeofday usec 152819686 TPUT ops/sec 10292286.6
Total ops 1572864000 Avg nsec/op 97.6 gettimeofday usec 153494359 TPUT ops/sec 10247047.6

What’s that? Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API. No, of course an Oracle workload is not all random writes, but a system has to be able to handle the difficult aspects of a workload in order to offer good throughput. I won’t ask the rhetorical question of why Oracle is not NUMA aware in the x86_64 Linux ports until my next blog entry where the measurements will not be based on the SLB, but a real Oracle instance instead.

Déjà vu
Hold it. Didn’t the Dell PS1900 with a Clovertown Xeon quad-core E5320’s exhibit ~500ns latencies with only 4 concurrent threads of SLB execution (1 per core)? That was what was shown in this blog entry. Interesting.

I hope it is becoming clear why NUMA awareness is interesting. NUMA systems offer a great deal of potential incremental bandwidth when local memory is preferred over remote memory.

Next up—comparisons of SUMA versus NUMA with the Jonathan Lewis Computing Index and why all is not lost just because the 10gR2 x86_64 Linux port is not NUMA aware.

Oracle on Opteron with Linux-The NUMA Angle (Part IV). Some More About the Silly Little Benchmark.

 

 

In my recent blog post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing the Silly Little Benchmark, I made available the SLB and hoped to get some folks to measure some other systems using the kit. Well, I got my first results back from a fellow member of the OakTable Network—Christian Antognini of Trivadis AG. I finally got to meet him face to face back in November 2006 at UKOUG.

Christian was nice enough to run it on a brand new Dell PE1900 with, yes, a quad-core “Clovertown” processor of the low-end E5320 variety. As packaged, this Clovertown-based system has a 1066MHz front side bus and the memory is configured with 4x1GB 667MHz dimms. The processor was clocked at 1.86GHz.

Here is a snippet of /proc/cpuinfo from Christian’s system:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
stepping : 7
cpu MHz : 1862.560

I asked Christian to run the SLB (memhammer) with 1, 2 and 4 threads of execution and to limit the amount of memory per process to 512MB. He submitted the following:

 

cha@helicon slb]$ cat example4.sh
./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 2
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./cpu_bind $$ 0
./memhammer 131072 6000 &
./trigger
wait
[cha@helicon slb]$ ./example4.sh
Total ops 786432000 Avg nsec/op 131.3 gettimeofday usec 103240338 TPUT ops/sec 7617487.7
Total ops 786432000 Avg nsec/op 250.4 gettimeofday usec 196953024 TPUT ops/sec 3992992.8
Total ops 786432000 Avg nsec/op 250.7 gettimeofday usec 197121780 TPUT ops/sec 3989574.4
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396010106 TPUT ops/sec 1985888.7
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396023560 TPUT ops/sec 1985821.2
Total ops 786432000 Avg nsec/op 504.6 gettimeofday usec 396854086 TPUT ops/sec 1981665.4
Total ops 786432000 Avg nsec/op 505.1 gettimeofday usec 397221522 TPUT ops/sec 1979832.3

So, while this is not a flagship Cloverdale Xeon (e.g., E5355), the latencies are pretty scary. Contrast these results with the DL585 Opteron 850 numbers I shared in this blog entry. The Opteron 850 is delivering 69ns with 2 concurrent threads of execution—some 47% quicker than this system exhibits with only 1 memhammer process running and the direct comparison of 2 concurrent memhammer processes is an astounding 3.6x slower than the Opteron 850 box. Here we see the true benefit of an on-die memory controller and the fact that Hypertransport is a true 1GHz path to memory. With 4 concurrent memhammer processes, the E5320 bogged down to 500ns! I’ll blog soon about what I see with the SLB on 4 sockets with my DL585 in the continuing NUMA series.

Other Numbers
I’d sure like to get numbers from others. How about Linux Itanium? How about Power system with AIX? How about some SPARC numbers. Anyone have a Soket F Opteron box they could collect SLB numbers on? If so, get the kit and run the same scripts Christian did on his Dell PE1900. Thanks Christian.

A Note About The SLB
Be sure to limit the memory allocation such that it does not cause major page faults or, eek, swapping. The first argument to memhamer is the number of 4KB pages to allocate.

Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing The Silly Little Benchmark.

In my blog “mini-series” about Oracle on Opteron NUMA, I am about to start covering the Linux 2.6 NUMA API and what it means to Oracle. I will share a lot of statspack information for certain, but first we need to go with micro-benchmark tests. The best micro-benchmark test for analysis of memory latency is one that uses the least amount of processor cycles to write memory that is most likely not in processor cache. That is, spend the fewest cycles to put the most load on the memory subsystem. To that end, I’d like to make available a SLB—a Silly Little Benchmark.

Introducing the Silly Little Benchmark
I took some old code of mine as the framework, stripped out all but the most simplistic code to make the points of memory latency and locality clear. Now, I’m not suggesting this SLB mimics Oracle at all. There are 3 significantly contentious aspects missing from this SLB that could bring it closer to what Oracle does to a system:

  1. Shared Memory
  2. Mutual Exclusion (e.g., spinlocks in shared memory)
  3. I/O

It just so happens that the larger code I ripped this SLB from does possess these three characteristics; however, I want to get the simplest form of it out there first. As this NUMA series progresses I’ll make other pieces available. Also, this version of the SLB is quite portable—it should work on about any modern Unix variant.

Where to Download the SLB
The SLB kit is stored in a tar archive here (slb.tar).

Description of the SLB
It is supposed to be very simple and it is. It consists of four parts:

  • create_sem: as simple as it sounds. It create a single IPC semaphore in advance of memhammer.
  • memhammer: the Silly Little Benchmark driver. It takes two arguments (without options). The first argument is the number of 4KB pages to allocate and the second is for how many loop iterations to perform
  • trigger: all memhammer processes wait on the semaphore created by create_sem, this operates the semaphore to trigger a run
  • cpu_bind: binds a process to a cpu

The first action for running the SLB is to execute create_sem. Next, fire off any number of memhammer processes up to the number of processors on the system. It makes no sense running more memhammer processes than processors in the machine. Each memhammer will use malloc(3) to allocate some heap, initializes it all with memset(3) and then wait on the semaphore created by create_sem. Next, execute trigger and the SLB will commence its work loop which loops through pseudo-random 4KB offsets in the malloc’ed memory and writes an 8 byte location within the first 64 bytes. Why 64 bytes? All the 64 bit systems I know of manage physical memory using a 64 byte cache line. As long as we write on any location residing entirely within a 64 byte line, we have caused as much work for the memory subsystem as we would if we scribbled on each of the eight 8-byte words the line can hold. Not scribbling over the entire line relieves us of the CPU overhead and allows us to put more duress on the memory subsystem—and that is the goal. SLB has a very small measured unit of work, but it causes maximum memory stalls. Well, not maximum, that would require spinlock contention, but this is good enough for this point of the NUMA mini-series.

Measuring Time
In prior lives, all of this sort of low-level measuring that I’ve done was performed with x86 assembly that reads the processor cycle counter—RDTSC. However, I’ve found it to be very inconsistent on multi-core AMD processors no matter how much fiddling I do to serialize the processor (e.g., with the CPUID instruction). It could just be me, I don’t know. It turns out that it is difficult to stop predictive reading of the TSC and I don’t have time to fiddle with it a pre-Socket F Opteron.  When I finally get my hands on a Socket F Opteron system, I’ll change my measurement technique to RDTSCP which is an atomic set of instructions to serialize and read the time stamp counter correctly. Until then, I think performing millions upon millions of operations and then dividing by microsecond resolution gettimeofday(2) should be about sufficient. Any trip through the work loop that gets nailed by hardware interrupts will unfortunately increase the average but running the SLB on an otherwise idle system should be a pretty clean test.

Example Measurements
Getting ready for the SLB is quite simple. Simply extract and compile:

$ ls -l slb.tar
-rw-r–r– 1 root root 20480 Jan 26 10:12 slb.tar
$ tar xvf slb.tar
cpu_bind.c
create_sem.c
Makefile
memhammer.c
trigger.c
$ make

cc -c -o memhammer.o memhammer.c
cc -O -o memhammer memhammer.o
cc -c -o trigger.o trigger.c
cc -O -o trigger trigger.o
cc -c -o create_sem.o create_sem.c
cc -O -o create_sem create_sem.o
cc -c -o cpu_bind.o cpu_bind.c
cc -O -o cpu_bind cpu_bind.o

Some Quick Measurements
I used my DL585 with 4 dual-core Opteron 850s to test 1 single invocation of memhammer then compared it to 2 invocations on the same socket. The first test bound the execution to processor number 7 which executed the test in seconds 108.28 seconds with an average write latency of 68.8ns. The next test was executed with 2 invocations both on the same physical CPU. This caused the result to be a bit “lumpy.” The average of the two was 70.2ns—about 2% more than the single incovation on the same processor. For what it is worth, there was 2.4% average latency variation betweent he two concurrent invocations:

$ cat example1
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

$ sh ./example1

Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2

Total ops 1572864000 Avg nsec/op 69.3 gettimeofday usec 108994268 TPUT ops/sec 14430703.8

Total ops 1572864000 Avg nsec/op 71.0 gettimeofday usec 111633529 TPUT ops/sec 14089530.4

The next test was to dedicate an entire socket to each of two concurrent invocations which really smoothed out the result. Executing this way resulted in less than 1/10th of 1 percent variance between the average write latencies:

$ cat example2

./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

$ sh ./example2

Total ops 1572864000 Avg nsec/op 69.1 gettimeofday usec 108640712 TPUT ops/sec 14477666.5

Total ops 1572864000 Avg nsec/op 69.2 gettimeofday usec 108871507 TPUT ops/sec 14446975.6

Cool And Quiet
Up to this point, the testing was done with the DL585s executing at 2200MHz. Since I have my DL585s set for dynamic power consumption adjustment, I can simply blast a SIGUSR2 at the cpuspeed processes. The spuspeed processes catch the SIGUSR2 and adjust the Pstate of the processor to the lowest power consumption—features supported by AMD Cool And Quiet Technology. The following shows how to determine what bounds the processor is fit to execute at. In my case, the processor will range from 2200MHz down to 1800MHz. Note, I recommend fixing the clock speed with the SIGUSR1 or SIGUSR2 before any performance testing. You might grow tired of inconsistent results. Note, there is no man page for the cpuspeed executable. You have to execute it to get a help message with command usage.

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2200000 2000000 1800000

# chkconfig –list cpuspeed
cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:on

# ps -ef | grep cpuspeed | grep -v grep
root 1796 1 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1797 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1798 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1799 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1800 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1801 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1802 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1803 1796 0 11:23 ? 00:00:00 cpuspeed -d -n

# cpuspeed -h 2>&1 | tail -12

To have a CPU stay at the highest clock speed to maximize performance send the process controlling that CPU the SIGUSR1 signal.

To have a CPU stay at the lowest clock speed to maximize battery life send the process controlling that CPU the SIGUSR2 signal.

To resume having a CPU’s clock speed dynamically scaled send the process controlling that CPU the SIGHUP signal.

So, now that the clock speed has been adjusted down 18% from 2200MHz to 1800MHz, let’s see what example1 does:

$ sh ./example1
Total ops 1572864000 Avg nsec/op 81.6 gettimeofday usec 128352437 TPUT ops/sec 12254259.0
Total ops 1572864000 Avg nsec/op 77.0 gettimeofday usec 121043918 TPUT ops/sec 12994159.7
Total ops 1572864000 Avg nsec/op 82.8 gettimeofday usec 130198851 TPUT ops/sec 12080475.3

The slower clock rate brought the single invocation number up to 81.6ns average—18%.

With the next blog entry, I’ll start to use the SLB to point out NUMA characteristics of 4-way Opteron servers. After that it will be time for some real Oracle numbers. Please stay tuned.

Non-Linux Platforms
My old buddy Glenn Fawcett of Sun’s Strategic Applications Engineering Group has collected data from both SPARC and Opteron-based Sun servers. He said to use the following compiler options:

  • -xarch=v9 … for sparc
  • -xarch=amd64 for opteron

He and I had Martini’s the other night, but I forgot to ask him to run it on idle systems for real measurement purposes. Having said that, I’d really love to see numbers from anyone that cares to run this. It would be great to have the output of /proc/cpuinfo, and the server make/model. In fact, if someone can run this on a Socket F Opteron system I’d greatly appreciate it. It seems our Socket F systems are not slated to arrive here for a few more weeks.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 744 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: