This is part 6 in a series about Oracle on Opteron-based NUMA servers running Linux. The list of prior installments can be found through my index of NUMA-related posts.
In part 5 of the series I discussed using Opteron-based servers with NUMA features disabled in the BIOS. Running an Opteron server (e.g., HP Proliant DL585) in this fashion is sometimes called SUMA (Sufficiently Uniform Memory Access) or SUMO (Sufficiently Uniform Memory Organization). At the risk of being controversial, I pointed out that in the Oracle Validated Configuration listing for Proliant, the recommendation is given to configure Opteron-based servers as SUMO/SUMA. In my experience, most folks do not change the BIOS and are therefore running a NUMA system since that is the default. However, if steps are taken to disable NUMA on an Opteron system, there are subtleties that warrant deeper understanding. How subtle are the subtleties? That question is the main theme of this blog series.
Memory Latencies with SUMA/SUMO vs NUMA
In part 5 of the series, I used the SLB memory latency workload to show how memory writes differ in NUMA versus SUMA/SUMO. I wrote:
Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API.
But What About Oracle?
What is the cost of running Oracle on SUMA? The simple answer is, it depends. More architectural background is needed before I go into that.
SUMA, NUMA and CYCLOPS
OK, so SUMA is what you get when you tweak a Proliant Opteron-based server so that memory is interleaved at the low level. Accompanying this with the setting of numa=off in the grub.conf file gets you a completely non-NUMA setup.
Cyclops
NUMA enabled in the BIOS, however, is the default. If the Oracle ports to Linux were NUMA-aware, that would be just fine. However, if the server isn’t configured as a SUMA and you boot Oracle without any consideration for the fact that you are on a NUMA system, you get what I call Cyclops. Let’s take a look at what I mean.
In the following screen shot I have booted an Oracle10g SGA of 7584MB on my Proliant DL585. The system is configured with 32GB physical memory which is, of course, 4 banks of 8GB each attached to one of the 4 dual-core Opterons (nodes). Before booting this SGA, I had between roughly 7.6GB and 7.7GB free memory on each of the memory banks. In the following figure it’s clear that after booting this 7584MB SGA I am left with all but 116MB of memory consumed from node 0 (socket 0)—Cyclops!
NOTE: You may need to right click->view the image
Right, so really bad things can happen if processes that are memory-resident on node 0 try to allocate more memory. In the 2.4 Kernel timeframe Red Hat points out such ill affect as OOM process termination in this web page. I haven’t spent much time researching how 2.6 responds to it because the point of this blog entry to not get into such a situation.
Let’s consider what things are like on a Cyclops even if there are no process or memory allocation failures. Let’s say, for instance, there is a listener with soft node affinity to node 2. All the sessions it forks off will have node affinity to node 2 where they will be granted pages for their kernel structures, page tables, stack, heap and so on. However, the entire SGA is remote memory since as you can see all the memory for the SGA was allocated from node 0. That is, um, not good.
Hugepages Are More Attractive Than Cyclops
Cyclops pops up its ugly single-eyed head only when you are running NUMA (not SUMA/SOMA) and fail to allocate/use hugepages. Whether you allocate hugepages off the grub boot line or out of sysctl.conf, memory for hugepages is allocated in a distributed fashion from the varying memory banks. Did I say round-robin? No. Because I don’t yet know whether it is round-robin or segmented. I have to leave something to blog about in the future.
The following is a screen shot of a session where I allocated 3800 2MB hugepages after the system was booted by echoing that value into /proc/sys/vm/nr_hugepages. Notice that unlike Cyclops, the pages are allocated for Oracle’s future use in a more distributed fashion from the various memory banks. I then booted Oracle. No Cyclops here.
Interleaving NUMA Memory Allocation
The numactl(8) command supports the notion of pushing memory allocation preferences down to its children. Until such time as the Linux port of Oracle is NUMA-aware internally—as was done in the Sequent DYNIX/ptx, SGI, DG, and to a lesser degree the Solaris Oracle10g port with MPO—the best hopes for efficient memory usage on a commodity NUMA system is to interleave the placement of shared memory via numactl(8). With the SGA allocated in this fashion on a 4-socket NUMA system, Oracle’s memory accesses for the variable and buffer pool components will have locality of up to 25%–generally speaking. Yes, I’m sure some session could go crazy with logical reads of 2 buffers 20,000 times per second or some pathological situation, but I am trying to cover the topic in more general terms. You might wonder how this differs from SUMA/SOMA though.
With SUMA, all memory is interleaved. That means even the NUMA-aware Linux 2.6 kernel cannot exploit the hardware architecture by allocating structures with respect to the memory hierarchies. That is a pure waste. Moreover, with SUMA, 100% of your Oracle memory accesses will hit interleaved memory. That includes PGA. In contrast, properly allocated NUMA-interleaved hugepages results in fairness in the SGA placement, but allocation in the PGA (heap) and stack for the sessions are 100% local memory! That is a good thing. In the following screen shot I coupled numactl(8) memory interleaving with hugepages.
Validated Oracle Configuration
As I pointed out, this Oracle Validated Configuration listing for Proliant recommends turning off NUMA. Now that I’m an HP employee, I’ll have to pursue that a bit because I don’t agree with it at all. You’ll see why when I post my performance measurements contrasting NUMA (with interleave hugepages) to SUMA/SOMA. Look at that Validated Configuration web page closely and you’ll see a recommendation to allow Oracle to use hugepages by tuning /etc/security/limits.conf, but neither allocation of hugepages from the grub boot line nor via the sysctl.conf file!
Could it be that the recommendations in this Validated Configuration were a knee-jerk reaction to Cyclops? I’m not much of a betting man, but I’d wager $5.00 that was the case. Like I said, I’m in HP now…I’ll have to see what all that was about.
Up Next
In my next installment, I will provide Oracle measurements contrasting SUMA and NUMA. I know I’ve said this would be the installment with Oracle performance numbers, but I had to lay too much ground work in this post. The mind can only absorb what the seat can endure.
Patent Infringement
For all you folks that hate the concept of software patents, here’s a good one. When my Sequent colleagues and I were working out the OS-requirements to support our NUMA-optimizations of the Oracle 8 port to Sequent’s NUMA-Q system, we knew early on we’d need a very rich set of enhancements to shmget() for memory region placement. So we specified the requirements to our OS developers. Lo and behold U.S. Patent 6,505,286 plopped out. So, for extra credit, can someone explain to me how the Linux 2.6 libnuma call numa_alloc_onnode() (described here) is not in complete violation of that patent? Hmmm…
Now for a real taste of NUMA-Oracle history, read the following: Sequent_NUMA_Oracle8i
Good series of articles. Thanks. Looking forward to the next one. Any chance of adding RAC to the picture? I like the way in which you start up a single database, but don’t see how I might apply that to a RAC environment running multiple databases.
As to your concern about Oracle/HP recommending turning NUMA off: the page you refer to (http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dl585_4node_fas9xx.html?rssid=valconfig) simply shows a “*validated* configuration”. It’s not a recommendation.
Kind regards,
Herta
We have 3 DL585 machines each with 4 2.4 Ghz opterons and 32GB ram. We had no idea of the recommendation to turn off the NUMA setting at the bios level and have therefore been running in standard NUMA mode. Apparently using huge pages has saved us. The machine most heavily burdened is running along quite happily.
Issues: One database instance on a separate machine will simply not use the huge pages. Another one was using it and stopped after reboot. We are not sure why. During research, we stumbled here (luckily). We have been looking at the “validated configuration” on Oracle’s site and will not disable NUMA after reading your blog. Thanks!
Forgot to add: we’re running Oracle 10.2.0.2 on SUSE Linux 9 SP2.
This is an excellent and very informative thread of blog, Kevin. Thanks!
I’m curious, have you done any testing or have any insights into vm.numa_memory_allocator? There is little info available on this tunable that I’ve found so far, and before I start sifting through kernel source, I wanted to throw this out there. This was added to Update 7 of RHEL 3 (yes, I know, a bit dated at this point), but since Oracle recommends disabling NUMA on DL 585 gear, and I’m using that with RHEL 3 U8+, I’m curious about no NUMA vs. leaving it enabled in the BIOS and using this tunable. I’m wondering if it helps avoid the cyclops scenario…
Thanks,
-charles.
Hi Charles,
It turns out that mention of a particular setting in a Validated Configuration doesn’t imply a support requirement. That aside, I really can’t comment on anything NUMA related in RHEL3 since I abandoned that long before I even got that DL585…which I no longer have since it is in an HP lab and I’ve moved on to Oracle…
Hi Charles,
It turns out that mention of a particular setting in a Validated Configuration doesn’t imply a support requirement. That aside, I really can’t comment on anything NUMA related in RHEL3 since I abandoned that long before I even got that DL585…which I no longer have since it is in an HP lab and I’ve moved on to Oracle…
Patents are always fun to argue. I’m neither totally for or against them. If someone patents a new hardware technology
and 50% of software engineers think of the same obvious ways to
use it then I don’t think the first person should get a patent.
One could say that cpu registers are a form of non-uniform
memory access. They are certainly fast. Hence, would the ‘C’
specifier ‘register’ be in violation of the patent because it is
a local allocation of “memory”? 🙂
The gall of the Linux folks actually wanting to allow
the use of “local faster memory” as local faster memory!
Somebody patents a hammer and then patents pounding things with it.
Was NUMA itself been patented? Or do all implementations
pay a royalty/licensed?
>>The gall of the Linux folks actually wanting to allow the use of “local faster memory” as local faster memory!
There is nothing wrong with wanting to do any particular optimization. The question is about the implementation method. I’m not here to argue whether any of my patents or patents of former Sequent colleagues were sufficiently novel to warrant a patent because it’s a bit too late for that. It is my opinion, however, that if you are the first to do something novel, useful and clear of prior art you’ve probably got something worthy of a patent. There needs to be incentive for being the first.
>>If someone patents a new hardware technology and 50% of software engineers think of the same obvious ways to use it then I don’t think the first person should get a patent.
…and neither do I. A patent needs to cover something non-obvious. That is one of the criteria. Everything is obvious with 20-20 hindsight.
>>Somebody patents a hammer and then patents pounding things with it.
…The hammer and pounding analogy comes up all the time. I tend to look at it more like someone invents the hammer then someone later invents an improvement that in a single embodiment makes it 1) physically impossible for the user to hit his thumb, 2) physically impossible to bend a nail, 3) physically impossible to drive the nail in crooked and amplifies kinetic energy 3-fold during usage.
NUMA itself is not patentable. If someone invented a NUMA system that automagically smoothed out all topology side-effects (e.g., latency, unfairness, etc) then I think that would be worthy of a patent and would make this specific patent topic conversation moot 🙂