Archive for the 'NUMA Oracle' Category

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – I.

Allocating hugepages for Oracle Database on Linux can be tricky. The following is a short list of some of the common problems associated with faulty attempts to get things properly configured:

  1. Insufficient Hugepages.You can be short just a single 2MB hugepage at instance startup and Oracle will silently fall back to no hugepages. For instance, if an instance needs 10,000 hugepages but there are only 9,999 available at startup Oracle will create non-hugepages IPC shared memory and the 9,999 (x 2MB) is just wasted memory.
    1. Insufficient hugepages is an even more difficult situation when booting with _enable_NUMA_support=TRUE as partial hugepages backing is possible.
  2. Improper Permissions. Both limits.conf(5) memlock and the shell ulimit –l must accommodate the desired amount of locked memory.

In general, list item 1 above has historically been the most difficult to deal with—especially on systems hosting several instances of Oracle. Since there is no way to determine whether an existing segment of shared memory is backed with hugepages, diagnostics are in short supply. Oracle Database 11g Release 2 (11.2.0.2) The fix for Oracle bugs 9195408 (unpublished) and 9931916 (published) is available in 11.2.0.2. In a sort of fast forward to the past, the Linux port now supports an initialization parameter to force the instance to use hugepages for all segments or fail to boot. I recall initialization parameters on Unix ports back in the early 1990s that did just that. The initialization parameter is called use_large_pages and setting it to “only” results in the all or none scenario. This, by the way, addresses list item 1.1 above. That is, setting use_large_pages=only ensures an instance will not have some NUMA segments backed with hugepages and others without. Consider the following example. Here we see that use_large_pages is set to “only” and yet the system has only a very small number of hugepages allocated (800 == ~1.6GB). First I’ll boot the instance using an init.ora file that does not force hugepages and then move on to using the one that does. Note, this is 11.2.0.2.

$ sqlplus '/ as sysdba'

SQL*Plus: Release 11.2.0.2.0 Production on Tue Sep 28 08:10:36 2010

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>
SQL> !grep -i huge /proc/meminfo
HugePages_Total:   800
HugePages_Free:    800
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
SQL>
SQL> !grep large_pages y.ora x.ora
use_large_pages=only
SQL>
SQL> startup force pfile=./x.ora
ORACLE instance started.

Total System Global Area 4.4363E+10 bytes
Fixed Size                  2242440 bytes
Variable`Size            1406199928 bytes
Database Buffers         4.2950E+10 bytes
Redo Buffers                4427776 bytes
Database mounted.
Database opened.
SQL> HOST date
Tue Sep 28 08:13:23 PDT 2010

SQL>  startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

The user feedback is a trite ORA-27102. So the question is,  which memory cannot be allocated? Let’s take a look at the alert log:

Tue Sep 28 08:16:05 2010
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 800 free: 800)
DFLT Huge Pages allocation successful (allocated: 512)
Huge Pages allocation failed (free: 288 required: 10432)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 3)
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 192)
NUMA Huge Pages allocation on node (1) (allocated: 64)

That is good diagnostic information. It informs us that the variable portion of the SGA was successfully allocated and backed with hugepages. It just so happens that my variable SGA component is precisely sized to 1GB. That much is simple to understand. After creating the segment for the variable SGA component Oracle moves on to create the NUMA buffer pool segments. This is a 2-socket Nehalem EP system and Oracle allocates from the Nth NUMA node and works back to node 0. In this case the first buffer pool creation attempt is for node 1 (socket 1). However, there were insufficient hugepages as indicated in the alert log. In the following example I allocated  another arbitrarily insufficient number of hugepages and tried to start an instance with use_large_pages=only. This particular insufficient hugepages scenario allows us to see more interesting diagnostics:

SQL>  !grep -i huge /proc/meminfo
HugePages_Total: 12000
HugePages_Free:  12000
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

SQL> startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

…and, the alert log:

Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 12000 free: 12000)
DFLT Huge Pages allocation successful (allocated: 512)
NUMA Huge Pages allocation on node (1) (allocated: 10432)
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 5184)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (0) (allocated: 704)
NUMA Huge Pages allocation on node (0) (allocated: 320)

In this example we see 12,000 hugepages was sufficient to back the variable SGA component and only 1 of the NUMA buffer pools (remember this is Nehalem EP with OS boot string numa=on).

Summary

In my opinion, this is a must-set parameter if you need hugepages. With initialization parameters like use_large_pages, configuring hugepages for Oracle Database is getting a lot simpler.

Next In Series

  1. “[…] if you need hugepages”
  2. More on hugepages and NUMA
  3. Any pitfalls I find.

More Hugepages Articles

Link to Part II in this series: Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II. Link to Part III in this series: Configuring Linux Hugepages for Oracle Database is Just Too Difficult! Isn’t It? Part – III. And more: Quantifying hugepages Memory Savings with Oracle Database 11g Little Things Doth Crabby Make – Part X. Posts About Linux Hugepages Makes Some Crabby It Seems. Also, Words About Sizing Hugepages. Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g. Little Things Doth Crabby Make – Part VIII. Hugepage Support for Oracle Database 11g Sometimes Means Using The ipcrm Command. Ugh. Oracle Database 11g Automatic Memory Management – Part I.

Linux Thinks It’s a CPU, But What Is It Really – Part III. How Do Intel Xeon 7500 (Nehalem EX) Processors Map To Linux OS Processors?

Last year I posted a blog entry entitled Linux Thinks It’s a CPU, But What Is It Really – Part I. Mapping Xeon 5500 (Nehalem) Processor Threads to Linux OS CPUs where I discussed the Intel CPU Topology Tool. The topology tool is most helpful when trying to quickly map Linux OS processors to physical processor cores or threads. That post gets read, on average, close to 20 times per day since it was posted (10,000+views) so I thought it deserves a follow-up pertaining to more recent Intel processors and, more importantly, more recent Linux releases.

I’m happy to point out that the tool still functions just fine for Intel Xeon 7500 series processors (a.k.a. Nehalem EX see also Sun Oracle’s Sun Fire X4800), however, with recent Linux releases the tool is not quite as necessary. With both Enterprise Linux Enterprise Linux Server release 5.5 (Oracle Enterprise Linux 5.5 ) and Red Hat Enterprise Linux Server release 5.5 the numactl(8) command now renders output that makes it quite clear which sockets associate with which OS processors.

The following output was captured from an 8-socket Nehalem EX machine:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 131062 MB
node 0 free: 122879 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 131072 MB
node 1 free: 125546 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 131072 MB
node 2 free: 125312 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 131072 MB
node 3 free: 126543 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 131072 MB
node 4 free: 125454 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 131072 MB
node 5 free: 124881 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 131072 MB
node 6 free: 123862 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 131072 MB
node 7 free: 126054 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  15  20  15  15  20  20  20 
  1:  15  10  15  20  20  15  20  20 
  2:  20  15  10  15  20  20  15  20 
  3:  15  20  15  10  20  20  20  15 
  4:  15  20  20  20  10  15  15  20 
  5:  20  15  20  20  15  10  20  15 
  6:  20  20  15  20  15  20  10  15 
  7:  20  20  20  15  20  15  15  10 

A node is synonymous with a socket in this case. So, as the output shows, socket 0 maps to OS processors 0-7 and 64-71, the latter range being processor threads. Let’s see how similar this output is to the Intel CPU Topology Tool (NOTE – hover over the box and click view source for best presentation):

Package 0 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L1I is Level 1 Instruction cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L2 is Level 2 Unified cache, size(KBytes)= 256,  Cores/cache= 2, Caches/package= 8
L3 is Level 3 Unified cache, size(KBytes)= 24576,  Cores/cache= 16, Caches/package= 1
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       0       64|       1       65|       2       66|       3       67|       4       68|       5       69|       6       70|       7       71|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|       1     1z16|       2     2z16|       4     4z16|       8     8z16|      10     1z17|      20     2z17|      40     4z17|      80     8z17|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff


Package 1 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       8       72|       9       73|      10       74|      11       75|      12       76|      13       77|      14       78|      15       79|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     100     1z18|     200     2z18|     400     4z18|     800     8z18|     1z3     1z19|     2z3     2z19|     4z3     4z19|     8z3     8z19|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff00


Package 2 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      16       80|      17       81|      18       82|      19       83|      20       84|      21       85|      22       86|      23       87|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z4     1z20|     2z4     2z20|     4z4     4z20|     8z4     8z20|     1z5     1z21|     2z5     2z21|     4z5     4z21|     8z5     8z21|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz4


Package 3 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      24       88|      25       89|      26       90|      27       91|      28       92|      29       93|      30       94|      31       95|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z6     1z22|     2z6     2z22|     4z6     4z22|     8z6     8z22|     1z7     1z23|     2z7     2z23|     4z7     4z23|     8z7     8z23|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz6


Package 4 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      32       96|      33       97|      34       98|      35       99|      36      100|      37      101|      38      102|      39      103|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z8     1z24|     2z8     2z24|     4z8     4z24|     8z8     8z24|     1z9     1z25|     2z9     2z25|     4z9     4z25|     8z9     8z25|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz8


Package 5 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      40      104|      41      105|      42      106|      43      107|      44      108|      45      109|      46      110|      47      111|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z10     1z26|    2z10     2z26|    4z10     4z26|    8z10     8z26|    1z11     1z27|    2z11     2z27|    4z11     4z27|    8z11     8z27|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz10


Package 6 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      48      112|      49      113|      50      114|      51      115|      52      116|      53      117|      54      118|      55      119|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z12     1z28|    2z12     2z28|    4z12     4z28|    8z12     8z28|    1z13     1z29|    2z13     2z29|    4z13     4z29|    8z13     8z29|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz12


Package 7 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      56      120|      57      121|      58      122|      59      123|      60      124|      61      125|      62      126|      63      127|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z14     1z30|    2z14     2z30|    4z14     4z30|    8z14     8z30|    1z15     1z31|    2z15     2z31|    4z15     4z31|    8z15     8z31|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

I’m quite happy to see this enhancement to numactl(8). I’ll try to blog soon on why you should care about this topic.

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part I.

In May 2009 I made a blog entry entitled You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II. There had not yet been a Part I but as I pointed out in that post I would loop back and make Part I. Here it is. Better late than never.

Background
I originally planned to use Part I to stroll down memory lane (back to 1995) with a story about the then VP of Oracle RDBMS Development’s initial impression about the Sequent DYNIX/ptx NUMA API during a session where we presented it and how it would be beneficial to code to NUMA APIs sooner rather than later. We were mixing vision with the specific need of our port to be honest.

We were the first to have a production NUMA API to which Oracle could port and we were quite a bit sooner to the whole NUMA trend than anyone else. Our’s was the first production NUMA system.

Now, this VP is no longer at Oracle but the  (redacted) response was, “Why would we want to use any of this ^#$%.”  We (me and the three others presenting the API) were caught off guard. However, we all knew that the question was a really good question. There were still good companies making really tight, high-end SMPs with uniform memory.  Just because we (Sequent) had to move into NUMA architecture didn’t mean we were blind to the reality around us. However, one thing we knew for sure—all systems in the future would have NUMA attributes of varying levels. All our competition was either in varying stages of denial or doing what I like to refer to as “Poo-pooh it while you do it.” All the major players eventually came out with NUMA systems.  Some sooner, some later and the others died trying.

That takes us to Commodity NUMA and the new purpose of this “Part I” post.

Before I say a word about this Part I I’d like to point out that the concepts in Part II are of a “must-know” variety unless you relinquish your computing power to some sort of hosted facility where you don’t have the luxury of caring about the architecture upon which you run Oracle Database.

Part II was about the different types of NUMA (historical and present) and such knowledge will help you if you find yourself in a troubling performance situation that relates to NUMA. NUMA is commodity, as I point out, and we have to come to grips with that.

What Is He Blogging About?
The current state of commodity NUMA is very peculiar. These Commodity NUMA Implementations (CNI) systems are so tightly coupled that most folks don’t even realize they are running on a NUMA system. In fact, let me go out on a ledge. I assert that nobody is configuring Oracle Database 11g Release 2 with NUMA optimizations in spite of the fact that they are on a NUMA box (e.g., Nehalem EP, AMD Opterton). The reason I believe this is because the init.ora parameter to invoke Oracle NUMA awareness changed names from 11gR1 to 11gR2 as per My Oracle Support note 864633.1. The parameter changed from _enable_NUMA_optimization to enable_NUMA_support. I know nobody is setting this because if they had I can almost guarantee they would have googled for problems. Allow me to explain.

If Nobody is Googling It, Nobody is Doing It
Anyone who tests _enable_NUMA_support as per My Oracle Support note 864633.1 will likely experience the sorts of problems that I detail later in this post. But first, let’s see what they would get from google when they search for _enable_NUMA_support:

Yes, just as I thought…Google found nothing. But what is my point? My point is two-fold. First, I happen to know that Nehalem EP  with QPI and Opteron with AMD HyperTransport are such good technologies that you really don’t have to care that much about NUMA software optimizations. At least to this point of the game. Reading M.O.S note 1053332.1 (regards disabling Linux NUMA support for Oracle Database Machine hosts) sort of drives that point home. However, saying you don’t need to care about NUMA doesn’t mean you shouldn’t experiment. How can anyone say that setting _enable_NUMA_support is a total placebo in all cases? One can’t prove a negative.

If you dare, trust me when I say that an understanding of NUMA will be as essential in the next 10 years as understanding SMP (parallelism and concurrency) was in the last 20 years. OK, off my soapbox.

Some Lessons in Enabling Oracle NUMA Optimizations with Oracle Database 11g Release 2
This section of the blog aims to point out that even when you think you might have tested Oracle NUMA optimizations there is a chance you didn’t. You have to know the way to ensure you have NUMA optimizations in play. Why? Well, if the configuration is not right for enabling NUMA features, Oracle Database will simply ignore you. Consider the following session where I demonstrate the following:

  1. Evidence that I am on a NUMA system (numactl(8))
  2. I started up an instance with a pfile (p4.ora) that has _enable_NUMA_support set to TRUE
  3. The instance started but _enable_NUMA_support was forced back to FALSE

Note, in spite of event #3, the alert log will not report anything to you about what went wrong.

SQL>
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 36317 MB
node 0 free: 31761 MB
node 1 size: 36360 MB
node 1 free: 35425 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

SQL> startup pfile=./p4.ora
ORACLE instance started.

Total System Global Area 5746786304 bytes
Fixed Size                  2213216 bytes
Variable Size            1207962272 bytes
Database Buffers         4294967296 bytes
Redo Buffers              241643520 bytes
Database mounted.
Database opened.
SQL> show parameter _enable_NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     FALSE

SQL>
SQL> !grep _enable_NUMA_support ./p4.ora
_enable_NUMA_support=TRUE

OK, so the instance is up and the parameter was reverted, what does the IPC shared memory segment look like?

SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x7393f7f4 1179653    oracle    660        5773459456 35
0x00000000 393223     oracle    644        790528     5          dest
0x00000000 425992     oracle    644        790528     5          dest
0x00000000 458761     oracle    644        790528     5          dest

Right, so I have no NUMA placement of the buffer pool. On Linux, Oracle must create multiple segments and allocate them on specific NUMA nodes (memory hierarchies). It was a little simpler for the first NUMA-aware port of Oracle (Sequent) since the APIs allowed for the creation of a single shared memory segment with regions of the segment placed onto different memories. Ho Hum.

What Went Wrong
Oracle could not find the libnuma.so it wanted to link with dlopen():

$ grep libnuma /tmp/strace.out | grep ENOENT | head
14626 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)
14627 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)

So I create the necessary symbolic link and subsequently boot the instance and inspect the shared memory segments. Here I see that I have a ~1GB segment for the variable SGA components and my buffer pool has been segmented into two roughly 2.3 GB segments.

# ls -l /usr/*64*/*numa*
lrwxrwxrwx 1 root root    23 Mar 17 09:25 /usr/lib64/libnuma.so -> /usr/lib64/libnuma.so.1
-rwxr-xr-x 1 root root 21752 Jul  7  2009 /usr/lib64/libnuma.so.1

SQL> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_size                        big integer 4G
SQL> show parameter NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     TRUE
SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x00000000 2719749    oracle    660        1006632960 35
0x00000000 2752518    oracle    660        2483027968 35
0x00000000 393223     oracle    644        790528     6          dest
0x00000000 425992     oracle    644        790528     6          dest
0x00000000 458761     oracle    644        790528     6          dest
0x00000000 2785290    oracle    660        2281701376 35
0x7393f7f4 2818059    oracle    660        2097152    35

So there I have an SGA successfully created with _enable_NUMA_support set to TRUE. But, what strings appear in the alert log? Well, I’ll blog that soon because it leads me to other content.

Oracle Database 10g 10.2.0.4 on Linux. A NUMA Fix?

I am not a DBA, but that doesn’t mean I lack respect for how difficult planned upgrades can be in complex Enterprise deployments. If I was a DBA planning an upgrade of my Oracle Database 10g Release 2 database I would be looking forward to 10.2.0.4. From what I’ve seen it looks like a pretty substantial release. Metalink 401436.1 spells out some of the bugs that could/should be fixed in the 10.2.0.4 release.

Those of you who have read my NUMA-related posts might be interested to know that Metalink 401436.1 lists bug 5173642 as once of the bugs that could/should be fixed in 10.2.0.4. That bug was essentially a deadline on NUMA optimizations in 10gR2 as is clear by the specified workaround:

Workaround: Do not use NUMA optimization. eg: Set the following init.ora parameters: _enable_numa_optimization=FALSE
_db_block_numa=1

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB
SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

This installment in my series about Oracle on Linux with NUMA hardware is very, very late. I started this series at the end of last year and it just kept getting put off—mostly because the hardware I needed to use was being used for other projects (my own projects). This is the seventh in the series and it’s time to show some Oracle numbers. Previously, I laid groundwork about such topics as SUMA/NUMA, NUMA API and so forth. To make those points I relied on microbenchmarks such as the Silly Little Benchmark. The previous installments can be found here.

To bring home the point that Oracle should be run on AMD boxes in NUMA mode (as opposed to SUMA), I decided to pick an Oracle workload that is very easy to understand as well as processor intensive. After all, the difference between SUMA and NUMA is higher memory latency so testing at any level below processor saturation actually provides the same throughput-albeit the SUMA result would come at a higher processor cost. To that end, measuring SUMA and NUMA at processor saturation is the best way to see the difference.

The workload I’ll use for this testing is what my friend Anjo Kolk refers to as the Jonathan Lewis Oracle Computing Index workload. The workload comes in script form and is very straightforward. The important thing about the workload is that it hammers memory which, of course, is the best way to see the NUMA effect. Jonathan Lewis needs no introduction of course.

The test was set up to execute 4, 8 16 and 32 concurrent invocations of the JL Comp script. The only difference in the test setup was that in one case I booted the server in SUMA mode and in another I booted in NUMA mode and allocated hugepages. As I point out in this post about SUMA, hugepages are allocated in a NUMA fashion and booting an SGA into this memory offers at least crude fairness placement of the SGA pages—certainly much better than a Cyclops. In short, what is being tested here one case where memory is allocated at boot time in a completely round-robin fashion versus the SGA being quasi-round robin yet page tables, kernel-side process-related structures and heap are all NUMA-optimized. Remember, this is no more difficult than a system boot option. Let’s get to the numbers.

jlcomp.jpg

I have also rolled up all the statspack reports into a word document (as required by WordPress). The document is numa-statspack.doc and it consist of 8 statspacks each prefaced by the name of what the specific test was. If you pattern search for REPORT NAME you will see each entry. Since this is a simple memory latency improvement, you might not be surprised how uninteresting the stats are-except of course the vast improvement in the number of logical reads per second the NUMA tests were able to push through the system.

SUMA or NUMA
A picture speaks a thousand words. This simple test combined with this simple graph covers it all pretty well. The job complete time ranged from about 12 to 15 percent better with NUMA at each of the concurrent session counts. While 12 to 15% isn’t astounding, remember this workload is completely processor bound. How do you usually recuperate 12-15% from a totally processor-bound workload without changing even a single line of code? Besides, this is only one workload and the fact remains that the more your particular workload does outside the SGA (e.g., sorting, etc) the more likely you are to see improvement. But by all means, do not run Oracle with Cyclops memory.

The Moral of the Story

Processors are going to get more cores and slower clock rates and memory topologies will look a lot more NUMA than SUMA as time progresses. I think it is important to understand NUMA.

What is Oracle Doing About It?
Well, I’ve blogged about the fact that the Linux ports of 10g do not integrate with libnuma. That means it is not NUMA-aware. What I’ve tried to show in this series is that the world of NUMA is not binary. There is more to it than SUMA or NUMA-aware. In the middle is booting the server and database in a fashion that at least allows benefit from the OS-side NUMA-awareness. The next step is Oracle NUMA-awareness.

Just recently I was sitting in a developer’s office in bldg 400 of Oracle HQ talking about NUMA. It was a good conversation. He stated that Oracle actually has NUMA awareness in it and I said, “I know.” I don’t think Sequent was on his mind and I can’t blame him—that was a long time ago. The vestiges of NUMA awareness in Oracle 10g trace back to the high-end proprietary NUMA implementations of the 1990s.  So if “it’s in there” what’s missing? We both said vgetcpu() at the same time. You see, you can’t have Oracle making runtime decisions about local versus remote memory if a process doesn’t know what CPU it is currently executing on (detection with less than a handful of instructions).  Things like vgetcpu() seem to be coming along. That means once these APIs are fully baked, I think we’ll see Oracle resurrect intrinsic NUMA awareness in the Linux port of Oracle Database akin to those wildcat ports of the late 90s…and that should be a good thing.

Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops.

This is part 6 in a series about Oracle on Opteron-based NUMA servers running Linux. The list of prior installments can be found through my index of NUMA-related posts.

In part 5 of the series I discussed using Opteron-based servers with NUMA features disabled in the BIOS. Running an Opteron server (e.g., HP Proliant DL585) in this fashion is sometimes called SUMA (Sufficiently Uniform Memory Access) or SUMO (Sufficiently Uniform Memory Organization). At the risk of being controversial, I pointed out that in the Oracle Validated Configuration listing for Proliant, the recommendation is given to configure Opteron-based servers as SUMO/SUMA. In my experience, most folks do not change the BIOS and are therefore running a NUMA system since that is the default. However, if steps are taken to disable NUMA on an Opteron system, there are subtleties that warrant deeper understanding. How subtle are the subtleties? That question is the main theme of this blog series.

Memory Latencies with SUMA/SUMO vs NUMA
In part 5 of the series, I used the SLB memory latency workload to show how memory writes differ in NUMA versus SUMA/SUMO. I wrote:

Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API.

But What About Oracle?
What is the cost of running Oracle on SUMA? The simple answer is, it depends. More architectural background is needed before I go into that.

SUMA, NUMA and CYCLOPS
OK, so SUMA is what you get when you tweak a Proliant Opteron-based server so that memory is interleaved at the low level. Accompanying this with the setting of numa=off in the grub.conf file gets you a completely non-NUMA setup.

Cyclops
NUMA enabled in the BIOS, however, is the default. If the Oracle ports to Linux were NUMA-aware, that would be just fine. However, if the server isn’t configured as a SUMA and you boot Oracle without any consideration for the fact that you are on a NUMA system, you get what I call Cyclops. Let’s take a look at what I mean.

In the following screen shot I have booted an Oracle10g SGA of 7584MB on my Proliant DL585. The system is configured with 32GB physical memory which is, of course, 4 banks of 8GB each attached to one of the 4 dual-core Opterons (nodes). Before booting this SGA, I had between roughly 7.6GB and 7.7GB free memory on each of the memory banks. In the following figure it’s clear that after booting this 7584MB SGA I am left with all but 116MB of memory consumed from node 0 (socket 0)—Cyclops!

NOTE: You may need to right click->view the image

cyclops1

Right, so really bad things can happen if processes that are memory-resident on node 0 try to allocate more memory. In the 2.4 Kernel timeframe Red Hat points out such ill affect as OOM process termination in this web page. I haven’t spent much time researching how 2.6 responds to it because the point of this blog entry to not get into such a situation.

Let’s consider what things are like on a Cyclops even if there are no process or memory allocation failures. Let’s say, for instance, there is a listener with soft node affinity to node 2. All the sessions it forks off will have node affinity to node 2 where they will be granted pages for their kernel structures, page tables, stack, heap and so on. However, the entire SGA is remote memory since as you can see all the memory for the SGA was allocated from node 0. That is, um, not good.

Hugepages Are More Attractive Than Cyclops
Cyclops pops up its ugly single-eyed head only when you are running NUMA (not SUMA/SOMA) and fail to allocate/use hugepages. Whether you allocate hugepages off the grub boot line or out of sysctl.conf, memory for hugepages is allocated in a distributed fashion from the varying memory banks. Did I say round-robin? No. Because I don’t yet know whether it is round-robin or segmented. I have to leave something to blog about in the future.

The following is a screen shot of a session where I allocated 3800 2MB hugepages after the system was booted by echoing that value into /proc/sys/vm/nr_hugepages. Notice that unlike Cyclops, the pages are allocated for Oracle’s future use in a more distributed fashion from the various memory banks. I then booted Oracle. No Cyclops here.

hugepages

Interleaving NUMA Memory Allocation
The numactl(8) command supports the notion of pushing memory allocation preferences down to its children. Until such time as the Linux port of Oracle is NUMA-aware internally—as was done in the Sequent DYNIX/ptx, SGI, DG, and to a lesser degree the Solaris Oracle10g port with MPO—the best hopes for efficient memory usage on a commodity NUMA system is to interleave the placement of shared memory via numactl(8). With the SGA allocated in this fashion on a 4-socket NUMA system, Oracle’s memory accesses for the variable and buffer pool components will have locality of up to 25%–generally speaking. Yes, I’m sure some session could go crazy with logical reads of 2 buffers 20,000 times per second or some pathological situation, but I am trying to cover the topic in more general terms. You might wonder how this differs from SUMA/SOMA though.

With SUMA, all memory is interleaved. That means even the NUMA-aware Linux 2.6 kernel cannot exploit the hardware architecture by allocating structures with respect to the memory hierarchies. That is a pure waste. Moreover, with SUMA, 100% of your Oracle memory accesses will hit interleaved memory. That includes PGA. In contrast, properly allocated NUMA-interleaved hugepages results in fairness in the SGA placement, but allocation in the PGA (heap) and stack for the sessions are 100% local memory! That is a good thing. In the following screen shot I coupled numactl(8) memory interleaving with hugepages.

interleave

Validated Oracle Configuration
As I pointed out, this Oracle Validated Configuration listing for Proliant recommends turning off NUMA. Now that I’m an HP employee, I’ll have to pursue that a bit because I don’t agree with it at all. You’ll see why when I post my performance measurements contrasting NUMA (with interleave hugepages) to SUMA/SOMA. Look at that Validated Configuration web page closely and you’ll see a recommendation to allow Oracle to use hugepages by tuning /etc/security/limits.conf, but neither allocation of hugepages from the grub boot line nor via the sysctl.conf file!

Could it be that the recommendations in this Validated Configuration were a knee-jerk reaction to Cyclops? I’m not much of a betting man, but I’d wager $5.00 that was the case. Like I said, I’m in HP now…I’ll have to see what all that was about.

Up Next
In my next installment, I will provide Oracle measurements contrasting SUMA and NUMA. I know I’ve said this would be the installment with Oracle performance numbers, but I had to lay too much ground work in this post. The mind can only absorb what the seat can endure.

Patent Infringement
For all you folks that hate the concept of software patents, here’s a good one. When my Sequent colleagues and I were working out the OS-requirements to support our NUMA-optimizations of the Oracle 8 port to Sequent’s NUMA-Q system, we knew early on we’d need a very rich set of enhancements to shmget() for memory region placement. So we specified the requirements to our OS developers. Lo and behold U.S. Patent 6,505,286 plopped out. So, for extra credit, can someone explain to me how the Linux 2.6 libnuma call numa_alloc_onnode() (described here) is not in complete violation of that patent? Hmmm…

Now for a real taste of NUMA-Oracle history, read the following: Sequent_NUMA_Oracle8i

Learn Danish Before You Learn About NUMA

I can’t speak Danish, but I have the next best thing—a Danish friend that speaks English. The Danish arm of Computer Reseller News has a video of Mogens Norgaard (founder of the OakTable Network of which I am glad to be a member). I have no idea whatsoever about what he is discussing, but since the video starts out with him pouring a beer I’m sure I’m missing out on something. No, hold it, I did get something. Featured prominently behind him is a well-used copy of my friend James Morle’s book Scaling Oracle8i.

By the way, if you want to be an Oracle expert, that book should be considered mandatory reading. I don’t care if it is based on Oracle8i, it is still rich with correct information. Also, if you are following my series on NUMA/Oracle, I particularly recommend section 8.1.2 which I contributed to this book. It covers the original NUMA port of Oracle—Sequent. Of particular interest should be the section on one of my only claims to fame: Quad-Local Buffer Preference.

I can’t recall, but perhaps that was the topic James and I were discussing in this photo Alex Gorbachev took at one of our pub stops during UKOUG 2006. Or, maybe we (James and I to the right in the photo) were discussing the guys to our right (Mogens and Thomas Presslie) who were wearing skirts—ur, uh, I mean kilts! I do recall that 5AM came early that morning. Not the best way to start my trip home.

Oracle on Opteron with Linux-The NUMA Angle (Part V). Introducing numactl(8) and SUMA. Is The Oracle x86_64 Linux Port NUMA Aware?

This blog entry is part five in a series. Please visit here for links to the previous installments.

Opteron-Based Servers are NUMA Systems
Or are they? It depends on how you boot them. For instance, I have 2 HP DL585 servers clustered with the PolyServe Database Utility for Oracle RAC. I booted one of the servers as a non-NUMA by tweaking the BIOS so that memory was interleaved on a 4KB basis. This is a memory model HP calls Sufficiently Uniform Memory Access (SUMA) as stated in this DL585 Technology Brief (pg. 6):

Node interleaving (SUMA) breaks memory into 4-KB addressable entities. Addressing starts with address 0 on node 0 and sequentially assigns through address 4095 to node 0, addresses 4096 through 8191 to node 1, addresses 8192 through 12287 to node 3, and addresses 12888

Booting in this fashion essentially turns an HP DL585 into a “flat-memory” SMP—or a SUMA in HP parlance. There seems to be conflicting monikers for using Opteron SMPs in this mode. IBM has a Redbook that covers the varying NUMA offerings in their System x portfolio. The abstract for this Redbook states:

The AMD Opteron implementation is called Sufficiently Uniform Memory Organization (SUMO) and is also a NUMA architecture. In the case of the Opteron, each processor has its own “local” memory with low latency. Every CPU can also access the memory of any other CPU in the system but at longer latency.

Whether it is SUMA or SUMO, the concept is cool, but a bit foreign to me given my NUMA background. The NUMA systems I worked on in the 90s consisted of distinct, separate small systems—each with their own memory and I/O cards, power supplies and so on. They were coupled into a single shared memory image with specialized hardware inserted into the system bus of each little system. These cards were linked together and the whole package was a cache coherent SMP (ccNUMA).

Is SUMA Recommended For Oracle?
Since the HP DL585 can be SUMA/SUMO, I thought I’d give it a test. But first I did a little research to see how most folks use these in the field. I know from the BIOS on my system that you actually get a warning and have to override it when setting up interleaved memory (SUMA). I also noticed that in one of HP’s Oracle Validated Configurations, the following statement is made:

Settings in the server BIOS adjusted to allow memory/node interleaving to work better with the ‘numa=off’ boot option

and:

Boot options
elevator=deadline numa=off

 

I found this to be strange, but I don’t yet fully understand why that recommendation is made. Why did they perform this validation with SUMA? When running a 4-socket Opteron system in SUMA mode, only 25% of all memory accesses will be to local memory. When I say all, I mean all—both user and kernel mode. The Linux 2.6 kernel is NUMA-aware so is seems like a waste to transform a NUMA system into a SUMA system? How can boiling down a NUMA system with interleaving (SUMA) possibly be optimal for Oracle? I will blog about this more as this series continues.

Is the x86_64 Linux Oracle Port NUMA Aware?
No, sorry, it is not. I might as well just come out and say it.

The NUMA API for Linux is very rudimentary compared to the boutique features in legacy NUMA systems like Sequent DYNIX/ptx and SGI IRIX, but it does support memory and process placement. I’ll blog later about this things it is missing that a NUMA aware Oracle port would require.

The Linux 2.6 kernel is NUMA aware, but what is there for applicaitons? The NUMA API which is implemented in the library called libnuma.so. But you don’t have to code to the API to effect NUMA awareness. The major 2.6 Linux kernel distributions (RHEL4 and SLES) ship with a command that uses the NUMA API in ways I’ll show later in this blog entry. The command is numactl(8) and it dynamically links to the NUMA API library (emphasis added by me):

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ type numactl
numactl is hashed (/usr/bin/numactl)
$ ldd /usr/bin/numactl
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003ba3200000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

Whereas the numactl(8) command links with libnuma.so, Oracle does not:

$ type oracle
oracle is /u01/app/oracle/product/10.2.0/db_1/bin/oracle
$ ldd /u01/app/oracle/product/10.2.0/db_1/bin/oracle
libskgxp10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxp10.so (0x0000002a95557000)
libhasgen10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libhasgen10.so (0x0000002a9565a000)
libskgxn2.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxn2.so (0x0000002a9584d000)
libocr10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocr10.so (0x0000002a9594f000)
libocrb10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrb10.so (0x0000002a95ab4000)
libocrutl10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrutl10.so (0x0000002a95bf0000)
libjox10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libjox10.so (0x0000002a95d65000)
libclsra10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libclsra10.so (0x0000002a96830000)
libdbcfg10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libdbcfg10.so (0x0000002a96938000)
libnnz10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libnnz10.so (0x0000002a96a55000)
libaio.so.1 => /usr/lib64/libaio.so.1 (0x0000002a96f15000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003ba3200000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003ba3400000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003ba3800000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003ba7300000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

No Big Deal, Right?
This NUMA stuff must just be a farce then, right? Let’s dig in. First, I’ll use the SLB (http://oaktable.net/getFile/148). Later I’ll move on to what fellow OakTable Network member Anjo Kolk and I refer to as the Jonathan Lewis Oracle Computing Index. The JL Oracle Computing Index is yet another microbenchmark that is very easy to run and compare memory throughput from one server to another using an Oracle workload. I’ll use this next to blog about NUMA effects/affects on a running instance of Oracle. After that I’ll move on to more robust Oracle OLTP and DSS workloads. But first, more SLB.

The SLB on SUMA/SOMA
First, let’s use the numactl(8) command to see what this DL585 looks like. Is it NUMA or SUMA?

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 1 nodes (0-0)
node 0 size: 32767 MB
node 0 free: 30640 MB

OK, this is a single node NUMA—or SUMA since it was booted with memory interleaving on. If it wasn’t for that boot option the command would report memory for all 4 “nodes” (nodes are sockets in the Opteron NUMA world). So, I set up a series of SLB tests as follows:

$ cat example1
echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two Threads, same core”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6

echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, same socket”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, different sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

echo “4 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./trigger
wait

echo “8 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./trigger
wait

And now the measurements:

$ sh ./example1
One thread
Total ops 1572864000 Avg nsec/op 71.5 gettimeofday usec 112433955 TPUT ops/sec 13989225.9
Two threads, same socket
Total ops 1572864000 Avg nsec/op 73.4 gettimeofday usec 115428009 TPUT ops/sec 13626363.4
Total ops 1572864000 Avg nsec/op 74.2 gettimeofday usec 116740373 TPUT ops/sec 13473179.5
Two threads, different sockets
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114759102 TPUT ops/sec 13705788.7
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114853095 TPUT ops/sec 13694572.2
4 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122879394 TPUT ops/sec 12800063.1
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122820373 TPUT ops/sec 12806214.2
Total ops 1572864000 Avg nsec/op 78.2 gettimeofday usec 123016921 TPUT ops/sec 12785753.3
Total ops 1572864000 Avg nsec/op 78.5 gettimeofday usec 123527864 TPUT ops/sec 12732868.1
8 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245773200 TPUT ops/sec 6399656.3
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245848989 TPUT ops/sec 6397683.4
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 245941009 TPUT ops/sec 6395289.7
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 246000176 TPUT ops/sec 6393751.5
Total ops 1572864000 Avg nsec/op 156.6 gettimeofday usec 246262366 TPUT ops/sec 6386944.2
Total ops 1572864000 Avg nsec/op 156.5 gettimeofday usec 246221624 TPUT ops/sec 6388001.1
Total ops 1572864000 Avg nsec/op 156.7 gettimeofday usec 246402465 TPUT ops/sec 6383312.8
Total ops 1572864000 Avg nsec/op 156.8 gettimeofday usec 246594031 TPUT ops/sec 6378353.9

SUMA baselines at 71.5ns average write operation and tops out at about 156ns with 8 concurrent threads of SLB execution (one per core). Let’s see what SLB on NUMA does.

SLB on NUMA
First, let’s get an idea what the memory layout is like:

$ uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 4 nodes (0-3)
node 0 size: 8191 MB
node 0 free: 5526 MB
node 1 size: 8191 MB
node 1 free: 6973 MB
node 2 size: 8191 MB
node 2 free: 7841 MB
node 3 size: 8191 MB
node 3 free: 7707 MB

OK, this means that there is approximately 5.5GB, 6.9GB, 7.8GB and 7.7GB of free memory on “nodes” 0, 1, 2 and 3 respectively. Why is the first node (node 0) lop-sided? I’ll tell you in the next blog entry. Let’s run some SLB. First, I’ll use numactl(8) to invoke memhammer with the directive that forces allocation of memory on a node-local basis. The first test is one memhammer process per socket:

$ cat ./membind_example.4
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ bash ./membind_example.4
Total ops 1572864000 Avg nsec/op 67.5 gettimeofday usec 106113673 TPUT ops/sec 14822444.2
Total ops 1572864000 Avg nsec/op 67.6 gettimeofday usec 106332351 TPUT ops/sec 14791961.1
Total ops 1572864000 Avg nsec/op 68.4 gettimeofday usec 107661537 TPUT ops/sec 14609340.0
Total ops 1572864000 Avg nsec/op 69.7 gettimeofday usec 109591100 TPUT ops/sec 14352114.4

This test is the same as the one above called “4 threads, 4 sockets” performed on the SOMA configuration where the latencies were 78ns. Switching from SOMA to NUMA and executing with NUMA placement brought the latencies down 13% to an average of 68ns. Interesting. Moreover, this test with 4 concurrent memhammer processes actually demonstrates better latencies than the single process average on SUMA which was 72ns. That comparison alone is quite interesting because it makes the point quite clear that SUMA in a 4-socket system is a 75% remote memory configuration—even for a single process like memhammer.

The next test was 2 memhammer processes per socket:

$ more membind_example.8
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ sh ./membind_example.8
Total ops 1572864000 Avg nsec/op 95.8 gettimeofday usec 150674658 TPUT ops/sec 10438809.2
Total ops 1572864000 Avg nsec/op 96.5 gettimeofday usec 151843720 TPUT ops/sec 10358439.6
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152368004 TPUT ops/sec 10322797.2
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152433799 TPUT ops/sec 10318341.5
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152436721 TPUT ops/sec 10318143.7
Total ops 1572864000 Avg nsec/op 97.0 gettimeofday usec 152635902 TPUT ops/sec 10304679.2
Total ops 1572864000 Avg nsec/op 97.2 gettimeofday usec 152819686 TPUT ops/sec 10292286.6
Total ops 1572864000 Avg nsec/op 97.6 gettimeofday usec 153494359 TPUT ops/sec 10247047.6

What’s that? Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API. No, of course an Oracle workload is not all random writes, but a system has to be able to handle the difficult aspects of a workload in order to offer good throughput. I won’t ask the rhetorical question of why Oracle is not NUMA aware in the x86_64 Linux ports until my next blog entry where the measurements will not be based on the SLB, but a real Oracle instance instead.

Déjà vu
Hold it. Didn’t the Dell PS1900 with a Clovertown Xeon quad-core E5320’s exhibit ~500ns latencies with only 4 concurrent threads of SLB execution (1 per core)? That was what was shown in this blog entry. Interesting.

I hope it is becoming clear why NUMA awareness is interesting. NUMA systems offer a great deal of potential incremental bandwidth when local memory is preferred over remote memory.

Next up—comparisons of SUMA versus NUMA with the Jonathan Lewis Computing Index and why all is not lost just because the 10gR2 x86_64 Linux port is not NUMA aware.

Using Linux sched_setaffinity(2) To Bind Oracle Processes To CPUs

I have been exploring the effect of process migration between CPUs in a multi-core Linux system while running long duration Oracle jobs. While Linux does schedule processes as best as possible for L2 cache affinity, I do see migrations on my HP DL 585 Opteron 850 box. Cache affinity is important, and routine migrations can slow down long running jobs. In fact, when a process gets scheduled to run on a CPU different than the one it last ran on the CPU will stall immediately while the cache is loaded with the process’ page tables—regardless of cache warmth. That is, the cache might have pages of text, data, stack and shared memory, but it won’t have the right versions of the page tables. Bear in mind that we are talking really small stalls here, but on long running jobs it can add up.

CPU_BIND
This Linux Journal webpage has the source for a program called cpu_bind that uses the Linux 2.6 sched_setaffinity(2) library routine to establish hard-affinity for a process to a specified CPU. I’ll be covering more of this in my NUMA series, but I thought I’d make a quick blog entry about this new to get the ball rolling.

After downloading the cpu_bind.c program, it is simple to compile and execute. The following session shows compilation and execution to set the PID of my current bash(1) shell to execute with hard affinity on CPU 3:

$ cc -o cpu_bind cpu_bind.c
$ cpu_bind $$ 3
$ while true
> do
> :
> done

The following is a screen shot of top(1) with CPU 3 utilized 100% in user mode by my looping shell. Note, you may have to ricght-click->vew image:

top1

If you wanted to experiment with Oracle, you could start a long running job and execute cp_bind on its PID once it is running, or do what I did with $$ and then invoke sqlplus for instance. Also, a SQL*Net listener process could be started with hard affinity to a certain CPU and you could connect to it when running a long CPU-bound job. Just a thought, but I’ll be showing real numbers in my NUMA series soon.

Give it a thought, see what you think.

The NUMA series links are:

Oracle on Opteron with Linux–The NUMA Angle (Part I)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

A little more groundwork. Trust me, the Linux NUMA API discussion that is about to begin and the microbenchmark and Oracle benchmark tests will make a lot more sense with all this old boring stuff behind you.

Another Terminology Reminder
When discussing NUMA, the term node is not the same as in clusters. Remember that all the memory from all the nodes (or Quads, QBBs, RADs, etc) appear to all the processors as cache-coherent main memory.

More About NUMA Aware Software
As I mentioned in Oracle on Opteron with Linux–The NUMA Angle (Part I), NUMA awareness is a software term that refers to kernel and user mode software that makes intelligent decisions about how to best utilize resources in a NUMA system. I use the generic term resources because as I’ve pointed out, there is more to NUMA than just the non-uniform memory aspect. Yes, the acronym is Non Uniform Memory Access, but the architecture actually supports the notion of having building blocks with only processors and cache, only memory, or only I/O adaptors. It may sound really weird, but it is conceivable that a very specialized storage subsystem could be built and incorporated into a NUMA system by presenting itself as memory. Or, on the other hand, one could envision a very specialized memory component—no processors, just memory—that could be built into a NUMA system. For instance, think of a really large NVRAM device that presents itself as main memory in a NUMA system. That’s much different than an NVRAM card stuffed into something like a PCI bus and accessed with a device driver. Wouldn’t that be a great place to put an in-memory database for instance? Even a system crash would leave the contents in memory. Dealing with such topology requires the kernel to be aware of the differing memory topology that lies beneath it, and a robust user mode API so applications can allocate memory properly (you can’t just blindly malloc(3) yourself into that sort of thing). But alas, I digress since there is no such system commercially available. My intent was merely to expound on the architecture a bit in order to make the discussion of NUMA awareness more interesting.

In retrospect, these advanced NUMA topics are the reason I think Digital’s moniker for the building blocks used in the AlphaServer GS product line was the most appropriate. They used the acronym RAD (Resource Affinity Domain) which opens up the possible list of ingredients greatly. An API call would return RAD characteristics such as how many processors, how much memory (if any) and so on a RAD consisted of. Great stuff. I wonder how that compares to the Linux NUMA API? Hmm, I guess I better get to blogging…

When it comes to the current state of “commodity NUMA” (e.g., Opteron and Itanium) there are no such exotic concepts. Basically, these systems have processors and memory “nodes” with varying latency due to locality—but I/O is equally costly for all processors. I’ll speak mostly of Opteron NUMA with Linux since that is what I deal with the most and that is where I have Oracle running.

For the really bored, here is a link to a AlphaServer GS320 diagram.

The following is a diagram of the Sequent NUMA-Q components that interfaced with the SHV Xeon chipset to make systems with up to 64 processors:

lynx1.jpg

OK, I promise, the next NUMA blog entry will get into the Linux NUMA API and what it means to Oracle.

Oracle on Opteron with Linux–The NUMA Angle (Part I)

There are Horrible Definitions of NUMA Out There on the Web
I want to start blogging about NUMA with regard to Oracle because NUMA has reached the commodity hardware scene with Opteron and Hypertransport technology Yes, I know Opteron has been available for a long time, but it wasn’t until the Linux 2.6 Kernel that there were legitimate claims of the OS being NUMA-aware. Before I can start blogging about NUMA/Oracle on Opteron related topics, I need to lay down some groundwork.

First, I’ll just come out and say it, I know NUMA—really, really well. I spent the latter half of the 1990’s inside the Sequent Port of Oracle working out NUMA-optimizations to exploit Sequent NUMA-Q 2000—the first commercially available NUMA system. Yes, Data General, SGI and Digital were soon to follow with AViiON, Origin 2000 and the AlphaServer GS320 respectively. The first port of Oracle to have code within the kernel specifically exploiting NUMA architecture was the Sequent port of Oracle8i.

 

Glossary
I’d like to offer a couple of quick definitions. The only NUMA that matters where Oracle is concerned is Cache Coherent NUMA (a.k.a CC-NUMA):

NUMA – A microprocessor-based computer system architecture comprised of compute nodes that possess processors and memory and usually disk/network I/O cards. A CC-NUMA system has specialized hardware that presents all the varying memory components as a single memory image to the processors. This has historically been accomplished with crossbar, switch or SCI ring technologies. In the case of Opteron, NUMA is built into the processor since each processor has an on-die memory controller. Understanding how a memory reference is satisfied in a NUMA system is the most important aspect of understanding NUMA. Each memory address referenced by the processors in a NUMA system is essentially “snooped” by the “NUMA memory controller” which in turn determines if the memory is local to the processor or remote. If remote, the NUMA “engine” must perform a fetch of the memory and install it into the requesting processor cache (which cache depends on the implementation although most have historically implemented an L3 cache for this remote-memory “staging”). The NUMA “engine” has to be keenly tuned to the processor’s capabilities since all memory related operations have to be supported including cache line invalidations and so forth. Implementations have varied wildly since the early 1990s. There have been NUMA systems that were comprised of complete systems linked by a NUMA engine. One such example was the Sequent NUMA-Q 2000 which was built on commodity Intel-based Pentium systems “chained” together by a very specialized piece of hardware that attached directly to each system bus. That specialized hardware was the called the Lynx Card which had an OBIC (Orion Bus Interface Controller) and a SCLIC (SCI Line Interface Controller) as well as 128MB L3 remote cache. On the Lynx card was a 510-pin GaAs ASIC that served as the “data pump” of the NUMA “engine”. These commodity NUMA “building blocks” were called “Quads” because they had 4 processors, local memory, local network and disk I/O adaptors—a lot of them. Digital referred to their physical building blocks as QBB (Quad Building Blocks) and logically (in their API for instance) as“RAD”s for Resource Affinity Domains. In the case of Opteron, each processor is considered a “node” with only CPU and memory locality. With Opteron, network and disk I/O are uniform.

NUMA Aware – This term applies to software. NUMA-aware software is optimized for NUMA such that the topology is understood and runtime decisions can be made such as what segment of memory to allocate from or what adaptor to perform I/O through. The latter, of course, not applying to Opteron. NUMA awareness starts in the kernel and with a NUMA API, applications too can be made NUMA aware. The Linux 2.6 Kernel had NUMA awareness built into the kernel—to a certain extent and there has been a NUMA API available for just as long. Is the Kernel fully NUMA-optimized? Not by any stretch of the imagination. Is the API complete? No. Does that mean the Linux NUMA-related technology is worthless? That is what I intend to blog about.

Some of the good engineers that build NUMA-awareness into the Sequent NUMA-Q operating system—DYNIX/ptx—have contributed NUMA awareness to Linux through their work in the IBM Linux Technology Center. That is a good thing.

This thread on Opteron and Linux NUMA is going to be very Oracle-centric and will come out as a series of installments. But first, a trip down memory lane.

The NUMA Stink
In the year 2000, Sun was finishing a very anti-NUMA campaign. I remember vividly the job interview I had with Sun’s Performance, Availability and Architecture Engineering (PAE) Group lead by Ganesh Ramamurthy. Those were really good guys, I enjoyed the interview and I think I even regretted turning down their offer so I could instead work in the Veritas Database Editions Group on the Oracle Disk Manager Library. One of the prevailing themes during that interview was how hush, hush, wink, wink they were about using the term NUMA to describe forthcoming systems such as StarCat. That attitude even showed in the following Business Review Online article where the VP Enterprise Systems at Sun in that time frame stated:

“We don’t think of the StarCat as a NUMA or COMA server,” he said. “This server has SMP latencies, and it is just a bigger, badder Starfire.”

No, it most certainly isn’t a COMA (although it did implement a few of the aspects of COMA) and it most certainly has always been a NUMA. Oops, I forgot to define COMA…next entry…and, oh, Opteron has made saying NUMA cool again!

 

AMD Quad-core “Barcelona” Processor For Oracle (Part II)

I am a huge AMD fan, but I am now giving up my hopes of finding any substantial information that could be used to predict what Oracle performance might be like on next year’s Barcelona (a.k.a. K8L) quad-core processor. I did, however, find another ” interesting blog” while trolling for information on this topic. Note, the quotes! Folks, NOTE THE QUOTES!!! I’m insinuating something there…

Lowered Expectations?
Anyway, what I am finding is that by AMD’s own predictions, we should expect Barcelona to outperform Intel’s Clovertown (Xeon 5355) processor by about 15% or so. The problem is that there really are no real numbers. You can view this AMD video about Barcelona. In it you’ll find a slide that shows their estimated 70% OLTP improvement over the Opteron 2200 SE product. The 2200 is a Socket F processor and luckily for us there is an audited TPC-C result of 34,923 TpmC/core. Note, I’m boiling down TPC results by core to make some sense of this. The Barcelona processor is 100% compatible with the Socket F family. I find it hard to imagine that Barcelona will be able to squeeze out a 70% performance increase from the same chipset. Oh well. But if it did, that would be a TPC-C result of 59,369 per core. So why then is that AMD video so focused on leap-frogging the Xeon 5355 which “only” gets 30,092 TpmC/core? And why the fixation on the Xeon 5355 when the Xeon 7140 “Tulsa” achieves 39,800 TpmC/core? It was nice and convenient to be able to compare the 2200SE, 5355 and 7140 with TPC results based on the same database—SQL Server.

I also see no evidence of IBM, HP or Dell planning to base a server on Barcelona. That’s scary. I’m expecting some quasi-inside information from Sun. Let’s see if that will help any of this make sense.

The following is shot of the AMD slide predicting 70% performance over the Xeon 5160 and Opteron 2200SE (which as I point out is a bit moot). You may have to right-click and view to zoom in on it:

AMD-Barcelona2

OLTP is Old News
Finally, I’m discovering that you don’t get much information about processors when searching for that old, boring OLTP stuff. If I search for “megatasking +AMD” on the other hand—now that produces a richness of information! I’ve also learned that “enthusiast” is a buzzword AMD and Intel are both beating on heavily. I was completely unaware that there is actually what is known as an “enthusiast market”. It seems customers in this particular market buy processors that also wind up in servers for OLTP. I just hope the processors they are making for “enthusiasts” are also reasonably fit for Oracle databases. I’m afraid we aren’t going to know until we find out.

In the meantime, I think I’ll push some megatasking tests through my cluster of DL585s.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,938 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: