Archive for the 'Oracle NUMA' Category

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – I.

Allocating hugepages for Oracle Database on Linux can be tricky. The following is a short list of some of the common problems associated with faulty attempts to get things properly configured:

  1. Insufficient Hugepages.You can be short just a single 2MB hugepage at instance startup and Oracle will silently fall back to no hugepages. For instance, if an instance needs 10,000 hugepages but there are only 9,999 available at startup Oracle will create non-hugepages IPC shared memory and the 9,999 (x 2MB) is just wasted memory.
    1. Insufficient hugepages is an even more difficult situation when booting with _enable_NUMA_support=TRUE as partial hugepages backing is possible.
  2. Improper Permissions. Both limits.conf(5) memlock and the shell ulimit –l must accommodate the desired amount of locked memory.

In general, list item 1 above has historically been the most difficult to deal with—especially on systems hosting several instances of Oracle. Since there is no way to determine whether an existing segment of shared memory is backed with hugepages, diagnostics are in short supply. Oracle Database 11g Release 2 (11.2.0.2) The fix for Oracle bugs 9195408 (unpublished) and 9931916 (published) is available in 11.2.0.2. In a sort of fast forward to the past, the Linux port now supports an initialization parameter to force the instance to use hugepages for all segments or fail to boot. I recall initialization parameters on Unix ports back in the early 1990s that did just that. The initialization parameter is called use_large_pages and setting it to “only” results in the all or none scenario. This, by the way, addresses list item 1.1 above. That is, setting use_large_pages=only ensures an instance will not have some NUMA segments backed with hugepages and others without. Consider the following example. Here we see that use_large_pages is set to “only” and yet the system has only a very small number of hugepages allocated (800 == ~1.6GB). First I’ll boot the instance using an init.ora file that does not force hugepages and then move on to using the one that does. Note, this is 11.2.0.2.

$ sqlplus '/ as sysdba'

SQL*Plus: Release 11.2.0.2.0 Production on Tue Sep 28 08:10:36 2010

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>
SQL> !grep -i huge /proc/meminfo
HugePages_Total:   800
HugePages_Free:    800
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
SQL>
SQL> !grep large_pages y.ora x.ora
use_large_pages=only
SQL>
SQL> startup force pfile=./x.ora
ORACLE instance started.

Total System Global Area 4.4363E+10 bytes
Fixed Size                  2242440 bytes
Variable`Size            1406199928 bytes
Database Buffers         4.2950E+10 bytes
Redo Buffers                4427776 bytes
Database mounted.
Database opened.
SQL> HOST date
Tue Sep 28 08:13:23 PDT 2010

SQL>  startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

The user feedback is a trite ORA-27102. So the question is,  which memory cannot be allocated? Let’s take a look at the alert log:

Tue Sep 28 08:16:05 2010
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 800 free: 800)
DFLT Huge Pages allocation successful (allocated: 512)
Huge Pages allocation failed (free: 288 required: 10432)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 3)
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 192)
NUMA Huge Pages allocation on node (1) (allocated: 64)

That is good diagnostic information. It informs us that the variable portion of the SGA was successfully allocated and backed with hugepages. It just so happens that my variable SGA component is precisely sized to 1GB. That much is simple to understand. After creating the segment for the variable SGA component Oracle moves on to create the NUMA buffer pool segments. This is a 2-socket Nehalem EP system and Oracle allocates from the Nth NUMA node and works back to node 0. In this case the first buffer pool creation attempt is for node 1 (socket 1). However, there were insufficient hugepages as indicated in the alert log. In the following example I allocated  another arbitrarily insufficient number of hugepages and tried to start an instance with use_large_pages=only. This particular insufficient hugepages scenario allows us to see more interesting diagnostics:

SQL>  !grep -i huge /proc/meminfo
HugePages_Total: 12000
HugePages_Free:  12000
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

SQL> startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

…and, the alert log:

Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 12000 free: 12000)
DFLT Huge Pages allocation successful (allocated: 512)
NUMA Huge Pages allocation on node (1) (allocated: 10432)
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 5184)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (0) (allocated: 704)
NUMA Huge Pages allocation on node (0) (allocated: 320)

In this example we see 12,000 hugepages was sufficient to back the variable SGA component and only 1 of the NUMA buffer pools (remember this is Nehalem EP with OS boot string numa=on).

Summary

In my opinion, this is a must-set parameter if you need hugepages. With initialization parameters like use_large_pages, configuring hugepages for Oracle Database is getting a lot simpler.

Next In Series

  1. “[…] if you need hugepages”
  2. More on hugepages and NUMA
  3. Any pitfalls I find.

More Hugepages Articles

Link to Part II in this series: Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II. Link to Part III in this series: Configuring Linux Hugepages for Oracle Database is Just Too Difficult! Isn’t It? Part – III. And more: Quantifying hugepages Memory Savings with Oracle Database 11g Little Things Doth Crabby Make – Part X. Posts About Linux Hugepages Makes Some Crabby It Seems. Also, Words About Sizing Hugepages. Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g. Little Things Doth Crabby Make – Part VIII. Hugepage Support for Oracle Database 11g Sometimes Means Using The ipcrm Command. Ugh. Oracle Database 11g Automatic Memory Management – Part I.

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part III.

By The Way, How Many NUMA Nodes Is Your AMD Opteron 6100-Based Server?

In my on-going series about Oracle Database 11g configuration for NUMA systems I’ve spoken of the enabling parameter and how it changed from _enable_NUMA_optimization (11.1) to _enable_NUMA_support (11.2). For convenience sake I’ll point to the other two posts in the series for folks that care to catch up.

What does AMD Opteron 6100 (Magny-Cours) have to do with my on-going series on enabling/disabling NUMA features in Oracle Database? That’s a good question. However, wouldn’t it be premature to just presume each of these 12-core processors is a NUMA node?

The AMD Opteron 6100 is a Multi-Chip Module (MCM). The “package” is two hex-core processors essentially “glued” together and placed into a socket. Each die has its own memory controller (hint, hint). I wonder what the Operating System sees in the case of a 4-socket server? Let’s take a peek.

The following is output from the numactl(8) command on a 4s48c Opteron 6100 (G34)-based server:

# numactl --hardware
available: 8 nodes (0-7)
node 0 size: 8060 MB
node 0 free: 7152 MB
node 1 size: 16160 MB
node 1 free: 16007 MB
node 2 size: 8080 MB
node 2 free: 8052 MB
node 3 size: 16160 MB
node 3 free: 15512 MB
node 4 size: 8080 MB
node 4 free: 8063 MB
node 5 size: 16160 MB
node 5 free: 15974 MB
node 6 size: 8080 MB
node 6 free: 8051 MB
node 7 size: 16160 MB
node 7 free: 15519 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  16  22  22  16
  2:  16  22  10  16  16  16  16  16
  3:  22  16  16  10  16  16  22  22
  4:  16  16  16  16  10  16  16  22
  5:  22  22  16  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  16  22  22  16  16  10

Heft
It wasn’t that long ago that an 8-node NUMA system was so large that a fork lift was necessary to move it about (think Sequent, SGI, DG, DEC etc). Even much more recent 8-socket (thus 8 NUMA nodes) servers were a 2-man lift and quite large (e.g., 7U HP Proliant DL785). These days, however, an 8-node NUMA system like the AMD Opteron 6100 (G34) comes in a 2U package!

Is it time yet to stop thinking that NUMA is niche technology?

I’ll blog soon about booting Oracle to test NUMA optimizations on these 8-node servers.

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part I.

In May 2009 I made a blog entry entitled You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II. There had not yet been a Part I but as I pointed out in that post I would loop back and make Part I. Here it is. Better late than never.

Background
I originally planned to use Part I to stroll down memory lane (back to 1995) with a story about the then VP of Oracle RDBMS Development’s initial impression about the Sequent DYNIX/ptx NUMA API during a session where we presented it and how it would be beneficial to code to NUMA APIs sooner rather than later. We were mixing vision with the specific need of our port to be honest.

We were the first to have a production NUMA API to which Oracle could port and we were quite a bit sooner to the whole NUMA trend than anyone else. Our’s was the first production NUMA system.

Now, this VP is no longer at Oracle but the  (redacted) response was, “Why would we want to use any of this ^#$%.”  We (me and the three others presenting the API) were caught off guard. However, we all knew that the question was a really good question. There were still good companies making really tight, high-end SMPs with uniform memory.  Just because we (Sequent) had to move into NUMA architecture didn’t mean we were blind to the reality around us. However, one thing we knew for sure—all systems in the future would have NUMA attributes of varying levels. All our competition was either in varying stages of denial or doing what I like to refer to as “Poo-pooh it while you do it.” All the major players eventually came out with NUMA systems.  Some sooner, some later and the others died trying.

That takes us to Commodity NUMA and the new purpose of this “Part I” post.

Before I say a word about this Part I I’d like to point out that the concepts in Part II are of a “must-know” variety unless you relinquish your computing power to some sort of hosted facility where you don’t have the luxury of caring about the architecture upon which you run Oracle Database.

Part II was about the different types of NUMA (historical and present) and such knowledge will help you if you find yourself in a troubling performance situation that relates to NUMA. NUMA is commodity, as I point out, and we have to come to grips with that.

What Is He Blogging About?
The current state of commodity NUMA is very peculiar. These Commodity NUMA Implementations (CNI) systems are so tightly coupled that most folks don’t even realize they are running on a NUMA system. In fact, let me go out on a ledge. I assert that nobody is configuring Oracle Database 11g Release 2 with NUMA optimizations in spite of the fact that they are on a NUMA box (e.g., Nehalem EP, AMD Opterton). The reason I believe this is because the init.ora parameter to invoke Oracle NUMA awareness changed names from 11gR1 to 11gR2 as per My Oracle Support note 864633.1. The parameter changed from _enable_NUMA_optimization to enable_NUMA_support. I know nobody is setting this because if they had I can almost guarantee they would have googled for problems. Allow me to explain.

If Nobody is Googling It, Nobody is Doing It
Anyone who tests _enable_NUMA_support as per My Oracle Support note 864633.1 will likely experience the sorts of problems that I detail later in this post. But first, let’s see what they would get from google when they search for _enable_NUMA_support:

Yes, just as I thought…Google found nothing. But what is my point? My point is two-fold. First, I happen to know that Nehalem EP  with QPI and Opteron with AMD HyperTransport are such good technologies that you really don’t have to care that much about NUMA software optimizations. At least to this point of the game. Reading M.O.S note 1053332.1 (regards disabling Linux NUMA support for Oracle Database Machine hosts) sort of drives that point home. However, saying you don’t need to care about NUMA doesn’t mean you shouldn’t experiment. How can anyone say that setting _enable_NUMA_support is a total placebo in all cases? One can’t prove a negative.

If you dare, trust me when I say that an understanding of NUMA will be as essential in the next 10 years as understanding SMP (parallelism and concurrency) was in the last 20 years. OK, off my soapbox.

Some Lessons in Enabling Oracle NUMA Optimizations with Oracle Database 11g Release 2
This section of the blog aims to point out that even when you think you might have tested Oracle NUMA optimizations there is a chance you didn’t. You have to know the way to ensure you have NUMA optimizations in play. Why? Well, if the configuration is not right for enabling NUMA features, Oracle Database will simply ignore you. Consider the following session where I demonstrate the following:

  1. Evidence that I am on a NUMA system (numactl(8))
  2. I started up an instance with a pfile (p4.ora) that has _enable_NUMA_support set to TRUE
  3. The instance started but _enable_NUMA_support was forced back to FALSE

Note, in spite of event #3, the alert log will not report anything to you about what went wrong.

SQL>
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 36317 MB
node 0 free: 31761 MB
node 1 size: 36360 MB
node 1 free: 35425 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

SQL> startup pfile=./p4.ora
ORACLE instance started.

Total System Global Area 5746786304 bytes
Fixed Size                  2213216 bytes
Variable Size            1207962272 bytes
Database Buffers         4294967296 bytes
Redo Buffers              241643520 bytes
Database mounted.
Database opened.
SQL> show parameter _enable_NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     FALSE

SQL>
SQL> !grep _enable_NUMA_support ./p4.ora
_enable_NUMA_support=TRUE

OK, so the instance is up and the parameter was reverted, what does the IPC shared memory segment look like?

SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x7393f7f4 1179653    oracle    660        5773459456 35
0x00000000 393223     oracle    644        790528     5          dest
0x00000000 425992     oracle    644        790528     5          dest
0x00000000 458761     oracle    644        790528     5          dest

Right, so I have no NUMA placement of the buffer pool. On Linux, Oracle must create multiple segments and allocate them on specific NUMA nodes (memory hierarchies). It was a little simpler for the first NUMA-aware port of Oracle (Sequent) since the APIs allowed for the creation of a single shared memory segment with regions of the segment placed onto different memories. Ho Hum.

What Went Wrong
Oracle could not find the libnuma.so it wanted to link with dlopen():

$ grep libnuma /tmp/strace.out | grep ENOENT | head
14626 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)
14627 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)

So I create the necessary symbolic link and subsequently boot the instance and inspect the shared memory segments. Here I see that I have a ~1GB segment for the variable SGA components and my buffer pool has been segmented into two roughly 2.3 GB segments.

# ls -l /usr/*64*/*numa*
lrwxrwxrwx 1 root root    23 Mar 17 09:25 /usr/lib64/libnuma.so -> /usr/lib64/libnuma.so.1
-rwxr-xr-x 1 root root 21752 Jul  7  2009 /usr/lib64/libnuma.so.1

SQL> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_size                        big integer 4G
SQL> show parameter NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     TRUE
SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x00000000 2719749    oracle    660        1006632960 35
0x00000000 2752518    oracle    660        2483027968 35
0x00000000 393223     oracle    644        790528     6          dest
0x00000000 425992     oracle    644        790528     6          dest
0x00000000 458761     oracle    644        790528     6          dest
0x00000000 2785290    oracle    660        2281701376 35
0x7393f7f4 2818059    oracle    660        2097152    35

So there I have an SGA successfully created with _enable_NUMA_support set to TRUE. But, what strings appear in the alert log? Well, I’ll blog that soon because it leads me to other content.

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB
SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops.

This is part 6 in a series about Oracle on Opteron-based NUMA servers running Linux. The list of prior installments can be found through my index of NUMA-related posts.

In part 5 of the series I discussed using Opteron-based servers with NUMA features disabled in the BIOS. Running an Opteron server (e.g., HP Proliant DL585) in this fashion is sometimes called SUMA (Sufficiently Uniform Memory Access) or SUMO (Sufficiently Uniform Memory Organization). At the risk of being controversial, I pointed out that in the Oracle Validated Configuration listing for Proliant, the recommendation is given to configure Opteron-based servers as SUMO/SUMA. In my experience, most folks do not change the BIOS and are therefore running a NUMA system since that is the default. However, if steps are taken to disable NUMA on an Opteron system, there are subtleties that warrant deeper understanding. How subtle are the subtleties? That question is the main theme of this blog series.

Memory Latencies with SUMA/SUMO vs NUMA
In part 5 of the series, I used the SLB memory latency workload to show how memory writes differ in NUMA versus SUMA/SUMO. I wrote:

Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API.

But What About Oracle?
What is the cost of running Oracle on SUMA? The simple answer is, it depends. More architectural background is needed before I go into that.

SUMA, NUMA and CYCLOPS
OK, so SUMA is what you get when you tweak a Proliant Opteron-based server so that memory is interleaved at the low level. Accompanying this with the setting of numa=off in the grub.conf file gets you a completely non-NUMA setup.

Cyclops
NUMA enabled in the BIOS, however, is the default. If the Oracle ports to Linux were NUMA-aware, that would be just fine. However, if the server isn’t configured as a SUMA and you boot Oracle without any consideration for the fact that you are on a NUMA system, you get what I call Cyclops. Let’s take a look at what I mean.

In the following screen shot I have booted an Oracle10g SGA of 7584MB on my Proliant DL585. The system is configured with 32GB physical memory which is, of course, 4 banks of 8GB each attached to one of the 4 dual-core Opterons (nodes). Before booting this SGA, I had between roughly 7.6GB and 7.7GB free memory on each of the memory banks. In the following figure it’s clear that after booting this 7584MB SGA I am left with all but 116MB of memory consumed from node 0 (socket 0)—Cyclops!

NOTE: You may need to right click->view the image

cyclops1

Right, so really bad things can happen if processes that are memory-resident on node 0 try to allocate more memory. In the 2.4 Kernel timeframe Red Hat points out such ill affect as OOM process termination in this web page. I haven’t spent much time researching how 2.6 responds to it because the point of this blog entry to not get into such a situation.

Let’s consider what things are like on a Cyclops even if there are no process or memory allocation failures. Let’s say, for instance, there is a listener with soft node affinity to node 2. All the sessions it forks off will have node affinity to node 2 where they will be granted pages for their kernel structures, page tables, stack, heap and so on. However, the entire SGA is remote memory since as you can see all the memory for the SGA was allocated from node 0. That is, um, not good.

Hugepages Are More Attractive Than Cyclops
Cyclops pops up its ugly single-eyed head only when you are running NUMA (not SUMA/SOMA) and fail to allocate/use hugepages. Whether you allocate hugepages off the grub boot line or out of sysctl.conf, memory for hugepages is allocated in a distributed fashion from the varying memory banks. Did I say round-robin? No. Because I don’t yet know whether it is round-robin or segmented. I have to leave something to blog about in the future.

The following is a screen shot of a session where I allocated 3800 2MB hugepages after the system was booted by echoing that value into /proc/sys/vm/nr_hugepages. Notice that unlike Cyclops, the pages are allocated for Oracle’s future use in a more distributed fashion from the various memory banks. I then booted Oracle. No Cyclops here.

hugepages

Interleaving NUMA Memory Allocation
The numactl(8) command supports the notion of pushing memory allocation preferences down to its children. Until such time as the Linux port of Oracle is NUMA-aware internally—as was done in the Sequent DYNIX/ptx, SGI, DG, and to a lesser degree the Solaris Oracle10g port with MPO—the best hopes for efficient memory usage on a commodity NUMA system is to interleave the placement of shared memory via numactl(8). With the SGA allocated in this fashion on a 4-socket NUMA system, Oracle’s memory accesses for the variable and buffer pool components will have locality of up to 25%–generally speaking. Yes, I’m sure some session could go crazy with logical reads of 2 buffers 20,000 times per second or some pathological situation, but I am trying to cover the topic in more general terms. You might wonder how this differs from SUMA/SOMA though.

With SUMA, all memory is interleaved. That means even the NUMA-aware Linux 2.6 kernel cannot exploit the hardware architecture by allocating structures with respect to the memory hierarchies. That is a pure waste. Moreover, with SUMA, 100% of your Oracle memory accesses will hit interleaved memory. That includes PGA. In contrast, properly allocated NUMA-interleaved hugepages results in fairness in the SGA placement, but allocation in the PGA (heap) and stack for the sessions are 100% local memory! That is a good thing. In the following screen shot I coupled numactl(8) memory interleaving with hugepages.

interleave

Validated Oracle Configuration
As I pointed out, this Oracle Validated Configuration listing for Proliant recommends turning off NUMA. Now that I’m an HP employee, I’ll have to pursue that a bit because I don’t agree with it at all. You’ll see why when I post my performance measurements contrasting NUMA (with interleave hugepages) to SUMA/SOMA. Look at that Validated Configuration web page closely and you’ll see a recommendation to allow Oracle to use hugepages by tuning /etc/security/limits.conf, but neither allocation of hugepages from the grub boot line nor via the sysctl.conf file!

Could it be that the recommendations in this Validated Configuration were a knee-jerk reaction to Cyclops? I’m not much of a betting man, but I’d wager $5.00 that was the case. Like I said, I’m in HP now…I’ll have to see what all that was about.

Up Next
In my next installment, I will provide Oracle measurements contrasting SUMA and NUMA. I know I’ve said this would be the installment with Oracle performance numbers, but I had to lay too much ground work in this post. The mind can only absorb what the seat can endure.

Patent Infringement
For all you folks that hate the concept of software patents, here’s a good one. When my Sequent colleagues and I were working out the OS-requirements to support our NUMA-optimizations of the Oracle 8 port to Sequent’s NUMA-Q system, we knew early on we’d need a very rich set of enhancements to shmget() for memory region placement. So we specified the requirements to our OS developers. Lo and behold U.S. Patent 6,505,286 plopped out. So, for extra credit, can someone explain to me how the Linux 2.6 libnuma call numa_alloc_onnode() (described here) is not in complete violation of that patent? Hmmm…

Now for a real taste of NUMA-Oracle history, read the following: Sequent_NUMA_Oracle8i

Oracle on Virtual Machines. Going Fishing? Intel “Nehalem” Xeon Quad-Core with CSI Floats!

CRN.com has coverage of the Xeon “Penryn” processor and some info about the micro-architecture change that will following in 2008 with the “Nehalem” processor. I think the following is an astounding comment:

Meanwhile, Intel is also preparing its next-generation Nehalem platform, which represents the company’s most significant shift in system architecture since the Pentium Pro debuted in 1996, Gelsinger said.

If you remember the P6 Orion chipset with the Pentium Pro, you’ll recall that it was Intel’s first MCM with 4 Pentium processors. It offered 48 bit memory support (kernel address space), 3 cycle shared L2 cache, and was quite the leap over the Pentium. The article states that the off-chip memory controller will be gone (good) and the interconnect (CSI) will be more like AMD HyperTransport. I think that means a bit of a NUMA feel, but I’m not sure yet. The architecture of Nehalem will support up to 8 cores as well.

What Does This Have To Do With Oracle
These are quad core processors that are going to pack a very significant punch—much more so than the AMD Barcelona processor expected later this year. That means single socket, quad core servers with more power than most 4 socket systems today. So if you have, say, a Proliant DL585 (great box) with idle cycles, you will likely have a lot of idle cycles when you refresh with these servers. That means virtualization—get use to it. The article hints towards 32nm processors in the 2010 timeframe. My oh my.

Where and What is a Nehalem, Really?
It is a North American Indian tribe. There is also a river about 40 miles from where I live and it is, in fact, precisely what Intel named this processor after. Intel has named other processors after rivers in the Pacific Northwest region of the states in the past (e.g., Willamette). I’ve been fishing the Nehalem for many, many years. I’m told blogs are better with photos, so here goes.

I’m sure the concept of fishing will wound the tender sentiment of at least a few readers. I’m sorry. You can’t make everyone happy, but I’ll throw a bone. The main species we fish for in the Nehalem is Steelhead which is an anadromous salmonid related to trout. Basically, it is a trout that lives in salt water but spawns in fresh water. Unlike true salmon, it can repeat that cycle. For that reason, game management in my home state enforce a great deal of “catch and release” and artificial bait regulations. That is in fact what I was doing when I caught the “Nehalem Bright”, as they are called, in the following photo. Caught, photographed and placed gently back into the water.

Nehalem_Brite

Learn Danish Before You Learn About NUMA

I can’t speak Danish, but I have the next best thing—a Danish friend that speaks English. The Danish arm of Computer Reseller News has a video of Mogens Norgaard (founder of the OakTable Network of which I am glad to be a member). I have no idea whatsoever about what he is discussing, but since the video starts out with him pouring a beer I’m sure I’m missing out on something. No, hold it, I did get something. Featured prominently behind him is a well-used copy of my friend James Morle’s book Scaling Oracle8i.

By the way, if you want to be an Oracle expert, that book should be considered mandatory reading. I don’t care if it is based on Oracle8i, it is still rich with correct information. Also, if you are following my series on NUMA/Oracle, I particularly recommend section 8.1.2 which I contributed to this book. It covers the original NUMA port of Oracle—Sequent. Of particular interest should be the section on one of my only claims to fame: Quad-Local Buffer Preference.

I can’t recall, but perhaps that was the topic James and I were discussing in this photo Alex Gorbachev took at one of our pub stops during UKOUG 2006. Or, maybe we (James and I to the right in the photo) were discussing the guys to our right (Mogens and Thomas Presslie) who were wearing skirts—ur, uh, I mean kilts! I do recall that 5AM came early that morning. Not the best way to start my trip home.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 743 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.