Archive for the 'Nehalem EX' Category

Exadata Database Machine X2-2 or X2-8? Sure! Why Not? Part II.

In my recent post entitled Exadata Database Machine X2-2 or X2-8? Sure! Why Not? Part I, I started to address the many questions folks are sending my way about what factors to consider when choosing between Exadata Database Machine X2-8 versus Exadata Database Machine X2-2. This post continues that thread.

As my friend Greg Rahn points out in his recent post about Exadata, the latest Exadata Storage Server is based on Intel Xeon 5600 (Westmere EP) processors. The Exadata Storage Server is the same whether the database grid is X2-2 or X2-8. The X2-2 database hosts are also based on Intel Xeon 5600. On the other hand, the X2-8 database hosts are based on Intel Xeon 7500 (Nehalem EX). This is a relevant distinction when thinking about database encryption.

Transparent Database Encryption

In his recent post, Greg brings up the topic of Oracle Database Transparent Data Encryption (TDE). As Greg points out, the new Exadata Storage Server software is able to leverage Intel Advanced Encryption Standard New Instructions (Intel AES-NI) found in the Intel Integrated Performance Primitives (Intel IPP) library because the processors in the storage servers are Intel Xeon 5600 (Westmere EP). Think of this as “hardware-assist.” However, in the case of the database hosts in the X2-8, there is no hardware-assist for TDE as Nehalem EX does not offer support for the necessary instructions. Westmere EX will—someday. So what does this mean?

TDE and Compression? Unlikely Cousins?

At first glance one would think there is nothing in common between TDE and compression. However, in an Exadata environment there is storage offload processing and for that reason roles are important to understand. That is, understanding what gets done is sometimes not as important as who is doing what.

When I speak to people about Exadata I tend to draw the mental picture of an “upper” and “lower” half. While the count of servers in each grid is not split 50/50 by any means, thinking about Exadata in this manner makes understanding certain features a lot simpler. Allow me to explain.

Compression

In the case of compressing data, all work is done by the upper half (the database grid). On the other hand, decompression effort takes place in either the upper or lower half depending on certain criteria.

  • Upper Half Compression. Always.
  • Lower Half Compression. Never
  • Lower Half Decompression. Data compressed with Hybrid Columnar Compression (HCC) is decompressed in the Exadata Storage Servers when accessed via Smart Scan. Visit my post about what triggers a Smart Scan for more information.
  • Upper Half Decompression. With all compression types, other than HCC, decompression effort takes place in the upper half. When accessed without Smart Scan, HCC data is also decompressed in the upper half.

Encryption

In the case of encryption, the upper/lower half breakout is as follows:

  • Upper Half Encryption. Always. Data is always encrypted by code executing in the database grid. If the processors are Intel Xeon 5600 (Westmere EP), as is the case with X2-2, there is hardware assist via the IPP library. The X2-8 is built on Nehalem EX and therefore does not offer hardware-assist encryption.
  • Lower Half Encryption. Never.
  • Lower Half Decryption. Smart Scan only. If data is not being accessed via Smart Scan the blocks are returned to the database host and buffered in the SGA (see the Seven Fundamentals). Both the X2-2 and X2-8 are attached to Westmere EP-based storage servers. To that end, both of these configurations benefit from hardware-assist decryption via the IPP libarary. I reiterate, however, that this hardware-assist lower-half decryption only occurs during Smart Scan.
  • Upper Half Decryption. Always in the case of data accessed without Smart Scan. In the case of X2-2, this upper-half decryption benefits from hardware-assist via the IPP library.

That pretty much covers it and now we see commonality between compression and encryption. The commonality is mostly related to whether or not a query is being serviced via Smart Scan.

That’s Not All

If HCC data is also stored in encrypted form, a Smart Scan is able to filter out vast amount of encrypted data without even touching it. That is, HCC short-circuits a lot of decryption cost. And, even though Exadata is really fast, it is always faster to not do something at all than to shift into high gear and do it as fast as possible.

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – I.

Allocating hugepages for Oracle Database on Linux can be tricky. The following is a short list of some of the common problems associated with faulty attempts to get things properly configured:

  1. Insufficient Hugepages.You can be short just a single 2MB hugepage at instance startup and Oracle will silently fall back to no hugepages. For instance, if an instance needs 10,000 hugepages but there are only 9,999 available at startup Oracle will create non-hugepages IPC shared memory and the 9,999 (x 2MB) is just wasted memory.
    1. Insufficient hugepages is an even more difficult situation when booting with _enable_NUMA_support=TRUE as partial hugepages backing is possible.
  2. Improper Permissions. Both limits.conf(5) memlock and the shell ulimit –l must accommodate the desired amount of locked memory.

In general, list item 1 above has historically been the most difficult to deal with—especially on systems hosting several instances of Oracle. Since there is no way to determine whether an existing segment of shared memory is backed with hugepages, diagnostics are in short supply. Oracle Database 11g Release 2 (11.2.0.2) The fix for Oracle bugs 9195408 (unpublished) and 9931916 (published) is available in 11.2.0.2. In a sort of fast forward to the past, the Linux port now supports an initialization parameter to force the instance to use hugepages for all segments or fail to boot. I recall initialization parameters on Unix ports back in the early 1990s that did just that. The initialization parameter is called use_large_pages and setting it to “only” results in the all or none scenario. This, by the way, addresses list item 1.1 above. That is, setting use_large_pages=only ensures an instance will not have some NUMA segments backed with hugepages and others without. Consider the following example. Here we see that use_large_pages is set to “only” and yet the system has only a very small number of hugepages allocated (800 == ~1.6GB). First I’ll boot the instance using an init.ora file that does not force hugepages and then move on to using the one that does. Note, this is 11.2.0.2.

$ sqlplus '/ as sysdba'

SQL*Plus: Release 11.2.0.2.0 Production on Tue Sep 28 08:10:36 2010

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>
SQL> !grep -i huge /proc/meminfo
HugePages_Total:   800
HugePages_Free:    800
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
SQL>
SQL> !grep large_pages y.ora x.ora
use_large_pages=only
SQL>
SQL> startup force pfile=./x.ora
ORACLE instance started.

Total System Global Area 4.4363E+10 bytes
Fixed Size                  2242440 bytes
Variable`Size            1406199928 bytes
Database Buffers         4.2950E+10 bytes
Redo Buffers                4427776 bytes
Database mounted.
Database opened.
SQL> HOST date
Tue Sep 28 08:13:23 PDT 2010

SQL>  startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

The user feedback is a trite ORA-27102. So the question is,  which memory cannot be allocated? Let’s take a look at the alert log:

Tue Sep 28 08:16:05 2010
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 800 free: 800)
DFLT Huge Pages allocation successful (allocated: 512)
Huge Pages allocation failed (free: 288 required: 10432)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 3)
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 192)
NUMA Huge Pages allocation on node (1) (allocated: 64)

That is good diagnostic information. It informs us that the variable portion of the SGA was successfully allocated and backed with hugepages. It just so happens that my variable SGA component is precisely sized to 1GB. That much is simple to understand. After creating the segment for the variable SGA component Oracle moves on to create the NUMA buffer pool segments. This is a 2-socket Nehalem EP system and Oracle allocates from the Nth NUMA node and works back to node 0. In this case the first buffer pool creation attempt is for node 1 (socket 1). However, there were insufficient hugepages as indicated in the alert log. In the following example I allocated  another arbitrarily insufficient number of hugepages and tried to start an instance with use_large_pages=only. This particular insufficient hugepages scenario allows us to see more interesting diagnostics:

SQL>  !grep -i huge /proc/meminfo
HugePages_Total: 12000
HugePages_Free:  12000
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

SQL> startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

…and, the alert log:

Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 12000 free: 12000)
DFLT Huge Pages allocation successful (allocated: 512)
NUMA Huge Pages allocation on node (1) (allocated: 10432)
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 5184)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (0) (allocated: 704)
NUMA Huge Pages allocation on node (0) (allocated: 320)

In this example we see 12,000 hugepages was sufficient to back the variable SGA component and only 1 of the NUMA buffer pools (remember this is Nehalem EP with OS boot string numa=on).

Summary

In my opinion, this is a must-set parameter if you need hugepages. With initialization parameters like use_large_pages, configuring hugepages for Oracle Database is getting a lot simpler.

Next In Series

  1. “[…] if you need hugepages”
  2. More on hugepages and NUMA
  3. Any pitfalls I find.

More Hugepages Articles

Link to Part II in this series: Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II. Link to Part III in this series: Configuring Linux Hugepages for Oracle Database is Just Too Difficult! Isn’t It? Part – III. And more: Quantifying hugepages Memory Savings with Oracle Database 11g Little Things Doth Crabby Make – Part X. Posts About Linux Hugepages Makes Some Crabby It Seems. Also, Words About Sizing Hugepages. Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g. Little Things Doth Crabby Make – Part VIII. Hugepage Support for Oracle Database 11g Sometimes Means Using The ipcrm Command. Ugh. Oracle Database 11g Automatic Memory Management – Part I.

Linux Thinks It’s a CPU, But What Is It Really – Part III. How Do Intel Xeon 7500 (Nehalem EX) Processors Map To Linux OS Processors?

Last year I posted a blog entry entitled Linux Thinks It’s a CPU, But What Is It Really – Part I. Mapping Xeon 5500 (Nehalem) Processor Threads to Linux OS CPUs where I discussed the Intel CPU Topology Tool. The topology tool is most helpful when trying to quickly map Linux OS processors to physical processor cores or threads. That post gets read, on average, close to 20 times per day since it was posted (10,000+views) so I thought it deserves a follow-up pertaining to more recent Intel processors and, more importantly, more recent Linux releases.

I’m happy to point out that the tool still functions just fine for Intel Xeon 7500 series processors (a.k.a. Nehalem EX see also Sun Oracle’s Sun Fire X4800), however, with recent Linux releases the tool is not quite as necessary. With both Enterprise Linux Enterprise Linux Server release 5.5 (Oracle Enterprise Linux 5.5 ) and Red Hat Enterprise Linux Server release 5.5 the numactl(8) command now renders output that makes it quite clear which sockets associate with which OS processors.

The following output was captured from an 8-socket Nehalem EX machine:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 131062 MB
node 0 free: 122879 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 131072 MB
node 1 free: 125546 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 131072 MB
node 2 free: 125312 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 131072 MB
node 3 free: 126543 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 131072 MB
node 4 free: 125454 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 131072 MB
node 5 free: 124881 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 131072 MB
node 6 free: 123862 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 131072 MB
node 7 free: 126054 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  15  20  15  15  20  20  20 
  1:  15  10  15  20  20  15  20  20 
  2:  20  15  10  15  20  20  15  20 
  3:  15  20  15  10  20  20  20  15 
  4:  15  20  20  20  10  15  15  20 
  5:  20  15  20  20  15  10  20  15 
  6:  20  20  15  20  15  20  10  15 
  7:  20  20  20  15  20  15  15  10 

A node is synonymous with a socket in this case. So, as the output shows, socket 0 maps to OS processors 0-7 and 64-71, the latter range being processor threads. Let’s see how similar this output is to the Intel CPU Topology Tool (NOTE – hover over the box and click view source for best presentation):

Package 0 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L1I is Level 1 Instruction cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L2 is Level 2 Unified cache, size(KBytes)= 256,  Cores/cache= 2, Caches/package= 8
L3 is Level 3 Unified cache, size(KBytes)= 24576,  Cores/cache= 16, Caches/package= 1
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       0       64|       1       65|       2       66|       3       67|       4       68|       5       69|       6       70|       7       71|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|       1     1z16|       2     2z16|       4     4z16|       8     8z16|      10     1z17|      20     2z17|      40     4z17|      80     8z17|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff


Package 1 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       8       72|       9       73|      10       74|      11       75|      12       76|      13       77|      14       78|      15       79|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     100     1z18|     200     2z18|     400     4z18|     800     8z18|     1z3     1z19|     2z3     2z19|     4z3     4z19|     8z3     8z19|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff00


Package 2 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      16       80|      17       81|      18       82|      19       83|      20       84|      21       85|      22       86|      23       87|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z4     1z20|     2z4     2z20|     4z4     4z20|     8z4     8z20|     1z5     1z21|     2z5     2z21|     4z5     4z21|     8z5     8z21|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz4


Package 3 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      24       88|      25       89|      26       90|      27       91|      28       92|      29       93|      30       94|      31       95|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z6     1z22|     2z6     2z22|     4z6     4z22|     8z6     8z22|     1z7     1z23|     2z7     2z23|     4z7     4z23|     8z7     8z23|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz6


Package 4 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      32       96|      33       97|      34       98|      35       99|      36      100|      37      101|      38      102|      39      103|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z8     1z24|     2z8     2z24|     4z8     4z24|     8z8     8z24|     1z9     1z25|     2z9     2z25|     4z9     4z25|     8z9     8z25|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz8


Package 5 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      40      104|      41      105|      42      106|      43      107|      44      108|      45      109|      46      110|      47      111|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z10     1z26|    2z10     2z26|    4z10     4z26|    8z10     8z26|    1z11     1z27|    2z11     2z27|    4z11     4z27|    8z11     8z27|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz10


Package 6 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      48      112|      49      113|      50      114|      51      115|      52      116|      53      117|      54      118|      55      119|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z12     1z28|    2z12     2z28|    4z12     4z28|    8z12     8z28|    1z13     1z29|    2z13     2z29|    4z13     4z29|    8z13     8z29|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz12


Package 7 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      56      120|      57      121|      58      122|      59      123|      60      124|      61      125|      62      126|      63      127|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z14     1z30|    2z14     2z30|    4z14     4z30|    8z14     8z30|    1z15     1z31|    2z15     2z31|    4z15     4z31|    8z15     8z31|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

I’m quite happy to see this enhancement to numactl(8). I’ll try to blog soon on why you should care about this topic.

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part I.

In May 2009 I made a blog entry entitled You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II. There had not yet been a Part I but as I pointed out in that post I would loop back and make Part I. Here it is. Better late than never.

Background
I originally planned to use Part I to stroll down memory lane (back to 1995) with a story about the then VP of Oracle RDBMS Development’s initial impression about the Sequent DYNIX/ptx NUMA API during a session where we presented it and how it would be beneficial to code to NUMA APIs sooner rather than later. We were mixing vision with the specific need of our port to be honest.

We were the first to have a production NUMA API to which Oracle could port and we were quite a bit sooner to the whole NUMA trend than anyone else. Our’s was the first production NUMA system.

Now, this VP is no longer at Oracle but the  (redacted) response was, “Why would we want to use any of this ^#$%.”  We (me and the three others presenting the API) were caught off guard. However, we all knew that the question was a really good question. There were still good companies making really tight, high-end SMPs with uniform memory.  Just because we (Sequent) had to move into NUMA architecture didn’t mean we were blind to the reality around us. However, one thing we knew for sure—all systems in the future would have NUMA attributes of varying levels. All our competition was either in varying stages of denial or doing what I like to refer to as “Poo-pooh it while you do it.” All the major players eventually came out with NUMA systems.  Some sooner, some later and the others died trying.

That takes us to Commodity NUMA and the new purpose of this “Part I” post.

Before I say a word about this Part I I’d like to point out that the concepts in Part II are of a “must-know” variety unless you relinquish your computing power to some sort of hosted facility where you don’t have the luxury of caring about the architecture upon which you run Oracle Database.

Part II was about the different types of NUMA (historical and present) and such knowledge will help you if you find yourself in a troubling performance situation that relates to NUMA. NUMA is commodity, as I point out, and we have to come to grips with that.

What Is He Blogging About?
The current state of commodity NUMA is very peculiar. These Commodity NUMA Implementations (CNI) systems are so tightly coupled that most folks don’t even realize they are running on a NUMA system. In fact, let me go out on a ledge. I assert that nobody is configuring Oracle Database 11g Release 2 with NUMA optimizations in spite of the fact that they are on a NUMA box (e.g., Nehalem EP, AMD Opterton). The reason I believe this is because the init.ora parameter to invoke Oracle NUMA awareness changed names from 11gR1 to 11gR2 as per My Oracle Support note 864633.1. The parameter changed from _enable_NUMA_optimization to enable_NUMA_support. I know nobody is setting this because if they had I can almost guarantee they would have googled for problems. Allow me to explain.

If Nobody is Googling It, Nobody is Doing It
Anyone who tests _enable_NUMA_support as per My Oracle Support note 864633.1 will likely experience the sorts of problems that I detail later in this post. But first, let’s see what they would get from google when they search for _enable_NUMA_support:

Yes, just as I thought…Google found nothing. But what is my point? My point is two-fold. First, I happen to know that Nehalem EP  with QPI and Opteron with AMD HyperTransport are such good technologies that you really don’t have to care that much about NUMA software optimizations. At least to this point of the game. Reading M.O.S note 1053332.1 (regards disabling Linux NUMA support for Oracle Database Machine hosts) sort of drives that point home. However, saying you don’t need to care about NUMA doesn’t mean you shouldn’t experiment. How can anyone say that setting _enable_NUMA_support is a total placebo in all cases? One can’t prove a negative.

If you dare, trust me when I say that an understanding of NUMA will be as essential in the next 10 years as understanding SMP (parallelism and concurrency) was in the last 20 years. OK, off my soapbox.

Some Lessons in Enabling Oracle NUMA Optimizations with Oracle Database 11g Release 2
This section of the blog aims to point out that even when you think you might have tested Oracle NUMA optimizations there is a chance you didn’t. You have to know the way to ensure you have NUMA optimizations in play. Why? Well, if the configuration is not right for enabling NUMA features, Oracle Database will simply ignore you. Consider the following session where I demonstrate the following:

  1. Evidence that I am on a NUMA system (numactl(8))
  2. I started up an instance with a pfile (p4.ora) that has _enable_NUMA_support set to TRUE
  3. The instance started but _enable_NUMA_support was forced back to FALSE

Note, in spite of event #3, the alert log will not report anything to you about what went wrong.

SQL>
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 36317 MB
node 0 free: 31761 MB
node 1 size: 36360 MB
node 1 free: 35425 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

SQL> startup pfile=./p4.ora
ORACLE instance started.

Total System Global Area 5746786304 bytes
Fixed Size                  2213216 bytes
Variable Size            1207962272 bytes
Database Buffers         4294967296 bytes
Redo Buffers              241643520 bytes
Database mounted.
Database opened.
SQL> show parameter _enable_NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     FALSE

SQL>
SQL> !grep _enable_NUMA_support ./p4.ora
_enable_NUMA_support=TRUE

OK, so the instance is up and the parameter was reverted, what does the IPC shared memory segment look like?

SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x7393f7f4 1179653    oracle    660        5773459456 35
0x00000000 393223     oracle    644        790528     5          dest
0x00000000 425992     oracle    644        790528     5          dest
0x00000000 458761     oracle    644        790528     5          dest

Right, so I have no NUMA placement of the buffer pool. On Linux, Oracle must create multiple segments and allocate them on specific NUMA nodes (memory hierarchies). It was a little simpler for the first NUMA-aware port of Oracle (Sequent) since the APIs allowed for the creation of a single shared memory segment with regions of the segment placed onto different memories. Ho Hum.

What Went Wrong
Oracle could not find the libnuma.so it wanted to link with dlopen():

$ grep libnuma /tmp/strace.out | grep ENOENT | head
14626 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)
14627 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)

So I create the necessary symbolic link and subsequently boot the instance and inspect the shared memory segments. Here I see that I have a ~1GB segment for the variable SGA components and my buffer pool has been segmented into two roughly 2.3 GB segments.

# ls -l /usr/*64*/*numa*
lrwxrwxrwx 1 root root    23 Mar 17 09:25 /usr/lib64/libnuma.so -> /usr/lib64/libnuma.so.1
-rwxr-xr-x 1 root root 21752 Jul  7  2009 /usr/lib64/libnuma.so.1

SQL> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_size                        big integer 4G
SQL> show parameter NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     TRUE
SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x00000000 2719749    oracle    660        1006632960 35
0x00000000 2752518    oracle    660        2483027968 35
0x00000000 393223     oracle    644        790528     6          dest
0x00000000 425992     oracle    644        790528     6          dest
0x00000000 458761     oracle    644        790528     6          dest
0x00000000 2785290    oracle    660        2281701376 35
0x7393f7f4 2818059    oracle    660        2097152    35

So there I have an SGA successfully created with _enable_NUMA_support set to TRUE. But, what strings appear in the alert log? Well, I’ll blog that soon because it leads me to other content.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 744 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: