Archive for the 'Nehalem' Category

Linux Thinks It’s a CPU, But What Is It Really – Part III. How Do Intel Xeon 7500 (Nehalem EX) Processors Map To Linux OS Processors?

Last year I posted a blog entry entitled Linux Thinks It’s a CPU, But What Is It Really – Part I. Mapping Xeon 5500 (Nehalem) Processor Threads to Linux OS CPUs where I discussed the Intel CPU Topology Tool. The topology tool is most helpful when trying to quickly map Linux OS processors to physical processor cores or threads. That post gets read, on average, close to 20 times per day since it was posted (10,000+views) so I thought it deserves a follow-up pertaining to more recent Intel processors and, more importantly, more recent Linux releases.

I’m happy to point out that the tool still functions just fine for Intel Xeon 7500 series processors (a.k.a. Nehalem EX see also Sun Oracle’s Sun Fire X4800), however, with recent Linux releases the tool is not quite as necessary. With both Enterprise Linux Enterprise Linux Server release 5.5 (Oracle Enterprise Linux 5.5 ) and Red Hat Enterprise Linux Server release 5.5 the numactl(8) command now renders output that makes it quite clear which sockets associate with which OS processors.

The following output was captured from an 8-socket Nehalem EX machine:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 131062 MB
node 0 free: 122879 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 131072 MB
node 1 free: 125546 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 131072 MB
node 2 free: 125312 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 131072 MB
node 3 free: 126543 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 131072 MB
node 4 free: 125454 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 131072 MB
node 5 free: 124881 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 131072 MB
node 6 free: 123862 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 131072 MB
node 7 free: 126054 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  15  20  15  15  20  20  20 
  1:  15  10  15  20  20  15  20  20 
  2:  20  15  10  15  20  20  15  20 
  3:  15  20  15  10  20  20  20  15 
  4:  15  20  20  20  10  15  15  20 
  5:  20  15  20  20  15  10  20  15 
  6:  20  20  15  20  15  20  10  15 
  7:  20  20  20  15  20  15  15  10 

A node is synonymous with a socket in this case. So, as the output shows, socket 0 maps to OS processors 0-7 and 64-71, the latter range being processor threads. Let’s see how similar this output is to the Intel CPU Topology Tool (NOTE – hover over the box and click view source for best presentation):

Package 0 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L1I is Level 1 Instruction cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L2 is Level 2 Unified cache, size(KBytes)= 256,  Cores/cache= 2, Caches/package= 8
L3 is Level 3 Unified cache, size(KBytes)= 24576,  Cores/cache= 16, Caches/package= 1
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       0       64|       1       65|       2       66|       3       67|       4       68|       5       69|       6       70|       7       71|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|       1     1z16|       2     2z16|       4     4z16|       8     8z16|      10     1z17|      20     2z17|      40     4z17|      80     8z17|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff


Package 1 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       8       72|       9       73|      10       74|      11       75|      12       76|      13       77|      14       78|      15       79|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     100     1z18|     200     2z18|     400     4z18|     800     8z18|     1z3     1z19|     2z3     2z19|     4z3     4z19|     8z3     8z19|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff00


Package 2 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      16       80|      17       81|      18       82|      19       83|      20       84|      21       85|      22       86|      23       87|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z4     1z20|     2z4     2z20|     4z4     4z20|     8z4     8z20|     1z5     1z21|     2z5     2z21|     4z5     4z21|     8z5     8z21|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz4


Package 3 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      24       88|      25       89|      26       90|      27       91|      28       92|      29       93|      30       94|      31       95|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z6     1z22|     2z6     2z22|     4z6     4z22|     8z6     8z22|     1z7     1z23|     2z7     2z23|     4z7     4z23|     8z7     8z23|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz6


Package 4 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      32       96|      33       97|      34       98|      35       99|      36      100|      37      101|      38      102|      39      103|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z8     1z24|     2z8     2z24|     4z8     4z24|     8z8     8z24|     1z9     1z25|     2z9     2z25|     4z9     4z25|     8z9     8z25|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz8


Package 5 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      40      104|      41      105|      42      106|      43      107|      44      108|      45      109|      46      110|      47      111|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z10     1z26|    2z10     2z26|    4z10     4z26|    8z10     8z26|    1z11     1z27|    2z11     2z27|    4z11     4z27|    8z11     8z27|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz10


Package 6 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      48      112|      49      113|      50      114|      51      115|      52      116|      53      117|      54      118|      55      119|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z12     1z28|    2z12     2z28|    4z12     4z28|    8z12     8z28|    1z13     1z29|    2z13     2z29|    4z13     4z29|    8z13     8z29|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz12


Package 7 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      56      120|      57      121|      58      122|      59      123|      60      124|      61      125|      62      126|      63      127|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z14     1z30|    2z14     2z30|    4z14     4z30|    8z14     8z30|    1z15     1z31|    2z15     2z31|    4z15     4z31|    8z15     8z31|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

I’m quite happy to see this enhancement to numactl(8). I’ll try to blog soon on why you should care about this topic.

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part I.

In May 2009 I made a blog entry entitled You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II. There had not yet been a Part I but as I pointed out in that post I would loop back and make Part I. Here it is. Better late than never.

Background
I originally planned to use Part I to stroll down memory lane (back to 1995) with a story about the then VP of Oracle RDBMS Development’s initial impression about the Sequent DYNIX/ptx NUMA API during a session where we presented it and how it would be beneficial to code to NUMA APIs sooner rather than later. We were mixing vision with the specific need of our port to be honest.

We were the first to have a production NUMA API to which Oracle could port and we were quite a bit sooner to the whole NUMA trend than anyone else. Our’s was the first production NUMA system.

Now, this VP is no longer at Oracle but the  (redacted) response was, “Why would we want to use any of this ^#$%.”  We (me and the three others presenting the API) were caught off guard. However, we all knew that the question was a really good question. There were still good companies making really tight, high-end SMPs with uniform memory.  Just because we (Sequent) had to move into NUMA architecture didn’t mean we were blind to the reality around us. However, one thing we knew for sure—all systems in the future would have NUMA attributes of varying levels. All our competition was either in varying stages of denial or doing what I like to refer to as “Poo-pooh it while you do it.” All the major players eventually came out with NUMA systems.  Some sooner, some later and the others died trying.

That takes us to Commodity NUMA and the new purpose of this “Part I” post.

Before I say a word about this Part I I’d like to point out that the concepts in Part II are of a “must-know” variety unless you relinquish your computing power to some sort of hosted facility where you don’t have the luxury of caring about the architecture upon which you run Oracle Database.

Part II was about the different types of NUMA (historical and present) and such knowledge will help you if you find yourself in a troubling performance situation that relates to NUMA. NUMA is commodity, as I point out, and we have to come to grips with that.

What Is He Blogging About?
The current state of commodity NUMA is very peculiar. These Commodity NUMA Implementations (CNI) systems are so tightly coupled that most folks don’t even realize they are running on a NUMA system. In fact, let me go out on a ledge. I assert that nobody is configuring Oracle Database 11g Release 2 with NUMA optimizations in spite of the fact that they are on a NUMA box (e.g., Nehalem EP, AMD Opterton). The reason I believe this is because the init.ora parameter to invoke Oracle NUMA awareness changed names from 11gR1 to 11gR2 as per My Oracle Support note 864633.1. The parameter changed from _enable_NUMA_optimization to enable_NUMA_support. I know nobody is setting this because if they had I can almost guarantee they would have googled for problems. Allow me to explain.

If Nobody is Googling It, Nobody is Doing It
Anyone who tests _enable_NUMA_support as per My Oracle Support note 864633.1 will likely experience the sorts of problems that I detail later in this post. But first, let’s see what they would get from google when they search for _enable_NUMA_support:

Yes, just as I thought…Google found nothing. But what is my point? My point is two-fold. First, I happen to know that Nehalem EP  with QPI and Opteron with AMD HyperTransport are such good technologies that you really don’t have to care that much about NUMA software optimizations. At least to this point of the game. Reading M.O.S note 1053332.1 (regards disabling Linux NUMA support for Oracle Database Machine hosts) sort of drives that point home. However, saying you don’t need to care about NUMA doesn’t mean you shouldn’t experiment. How can anyone say that setting _enable_NUMA_support is a total placebo in all cases? One can’t prove a negative.

If you dare, trust me when I say that an understanding of NUMA will be as essential in the next 10 years as understanding SMP (parallelism and concurrency) was in the last 20 years. OK, off my soapbox.

Some Lessons in Enabling Oracle NUMA Optimizations with Oracle Database 11g Release 2
This section of the blog aims to point out that even when you think you might have tested Oracle NUMA optimizations there is a chance you didn’t. You have to know the way to ensure you have NUMA optimizations in play. Why? Well, if the configuration is not right for enabling NUMA features, Oracle Database will simply ignore you. Consider the following session where I demonstrate the following:

  1. Evidence that I am on a NUMA system (numactl(8))
  2. I started up an instance with a pfile (p4.ora) that has _enable_NUMA_support set to TRUE
  3. The instance started but _enable_NUMA_support was forced back to FALSE

Note, in spite of event #3, the alert log will not report anything to you about what went wrong.

SQL>
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 36317 MB
node 0 free: 31761 MB
node 1 size: 36360 MB
node 1 free: 35425 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

SQL> startup pfile=./p4.ora
ORACLE instance started.

Total System Global Area 5746786304 bytes
Fixed Size                  2213216 bytes
Variable Size            1207962272 bytes
Database Buffers         4294967296 bytes
Redo Buffers              241643520 bytes
Database mounted.
Database opened.
SQL> show parameter _enable_NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     FALSE

SQL>
SQL> !grep _enable_NUMA_support ./p4.ora
_enable_NUMA_support=TRUE

OK, so the instance is up and the parameter was reverted, what does the IPC shared memory segment look like?

SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x7393f7f4 1179653    oracle    660        5773459456 35
0x00000000 393223     oracle    644        790528     5          dest
0x00000000 425992     oracle    644        790528     5          dest
0x00000000 458761     oracle    644        790528     5          dest

Right, so I have no NUMA placement of the buffer pool. On Linux, Oracle must create multiple segments and allocate them on specific NUMA nodes (memory hierarchies). It was a little simpler for the first NUMA-aware port of Oracle (Sequent) since the APIs allowed for the creation of a single shared memory segment with regions of the segment placed onto different memories. Ho Hum.

What Went Wrong
Oracle could not find the libnuma.so it wanted to link with dlopen():

$ grep libnuma /tmp/strace.out | grep ENOENT | head
14626 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)
14627 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)

So I create the necessary symbolic link and subsequently boot the instance and inspect the shared memory segments. Here I see that I have a ~1GB segment for the variable SGA components and my buffer pool has been segmented into two roughly 2.3 GB segments.

# ls -l /usr/*64*/*numa*
lrwxrwxrwx 1 root root    23 Mar 17 09:25 /usr/lib64/libnuma.so -> /usr/lib64/libnuma.so.1
-rwxr-xr-x 1 root root 21752 Jul  7  2009 /usr/lib64/libnuma.so.1

SQL> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_size                        big integer 4G
SQL> show parameter NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     TRUE
SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x00000000 2719749    oracle    660        1006632960 35
0x00000000 2752518    oracle    660        2483027968 35
0x00000000 393223     oracle    644        790528     6          dest
0x00000000 425992     oracle    644        790528     6          dest
0x00000000 458761     oracle    644        790528     6          dest
0x00000000 2785290    oracle    660        2281701376 35
0x7393f7f4 2818059    oracle    660        2097152    35

So there I have an SGA successfully created with _enable_NUMA_support set to TRUE. But, what strings appear in the alert log? Well, I’ll blog that soon because it leads me to other content.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 3,010 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: