Archive for the 'Nehalem EP' Category

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – I.

Allocating hugepages for Oracle Database on Linux can be tricky. The following is a short list of some of the common problems associated with faulty attempts to get things properly configured:

  1. Insufficient Hugepages.You can be short just a single 2MB hugepage at instance startup and Oracle will silently fall back to no hugepages. For instance, if an instance needs 10,000 hugepages but there are only 9,999 available at startup Oracle will create non-hugepages IPC shared memory and the 9,999 (x 2MB) is just wasted memory.
    1. Insufficient hugepages is an even more difficult situation when booting with _enable_NUMA_support=TRUE as partial hugepages backing is possible.
  2. Improper Permissions. Both limits.conf(5) memlock and the shell ulimit –l must accommodate the desired amount of locked memory.

In general, list item 1 above has historically been the most difficult to deal with—especially on systems hosting several instances of Oracle. Since there is no way to determine whether an existing segment of shared memory is backed with hugepages, diagnostics are in short supply. Oracle Database 11g Release 2 (11.2.0.2) The fix for Oracle bugs 9195408 (unpublished) and 9931916 (published) is available in 11.2.0.2. In a sort of fast forward to the past, the Linux port now supports an initialization parameter to force the instance to use hugepages for all segments or fail to boot. I recall initialization parameters on Unix ports back in the early 1990s that did just that. The initialization parameter is called use_large_pages and setting it to “only” results in the all or none scenario. This, by the way, addresses list item 1.1 above. That is, setting use_large_pages=only ensures an instance will not have some NUMA segments backed with hugepages and others without. Consider the following example. Here we see that use_large_pages is set to “only” and yet the system has only a very small number of hugepages allocated (800 == ~1.6GB). First I’ll boot the instance using an init.ora file that does not force hugepages and then move on to using the one that does. Note, this is 11.2.0.2.

$ sqlplus '/ as sysdba'

SQL*Plus: Release 11.2.0.2.0 Production on Tue Sep 28 08:10:36 2010

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>
SQL> !grep -i huge /proc/meminfo
HugePages_Total:   800
HugePages_Free:    800
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
SQL>
SQL> !grep large_pages y.ora x.ora
use_large_pages=only
SQL>
SQL> startup force pfile=./x.ora
ORACLE instance started.

Total System Global Area 4.4363E+10 bytes
Fixed Size                  2242440 bytes
Variable`Size            1406199928 bytes
Database Buffers         4.2950E+10 bytes
Redo Buffers                4427776 bytes
Database mounted.
Database opened.
SQL> HOST date
Tue Sep 28 08:13:23 PDT 2010

SQL>  startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

The user feedback is a trite ORA-27102. So the question is,  which memory cannot be allocated? Let’s take a look at the alert log:

Tue Sep 28 08:16:05 2010
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 800 free: 800)
DFLT Huge Pages allocation successful (allocated: 512)
Huge Pages allocation failed (free: 288 required: 10432)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 3)
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 192)
NUMA Huge Pages allocation on node (1) (allocated: 64)

That is good diagnostic information. It informs us that the variable portion of the SGA was successfully allocated and backed with hugepages. It just so happens that my variable SGA component is precisely sized to 1GB. That much is simple to understand. After creating the segment for the variable SGA component Oracle moves on to create the NUMA buffer pool segments. This is a 2-socket Nehalem EP system and Oracle allocates from the Nth NUMA node and works back to node 0. In this case the first buffer pool creation attempt is for node 1 (socket 1). However, there were insufficient hugepages as indicated in the alert log. In the following example I allocated  another arbitrarily insufficient number of hugepages and tried to start an instance with use_large_pages=only. This particular insufficient hugepages scenario allows us to see more interesting diagnostics:

SQL>  !grep -i huge /proc/meminfo
HugePages_Total: 12000
HugePages_Free:  12000
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

SQL> startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

…and, the alert log:

Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 12000 free: 12000)
DFLT Huge Pages allocation successful (allocated: 512)
NUMA Huge Pages allocation on node (1) (allocated: 10432)
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 5184)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (0) (allocated: 704)
NUMA Huge Pages allocation on node (0) (allocated: 320)

In this example we see 12,000 hugepages was sufficient to back the variable SGA component and only 1 of the NUMA buffer pools (remember this is Nehalem EP with OS boot string numa=on).

Summary

In my opinion, this is a must-set parameter if you need hugepages. With initialization parameters like use_large_pages, configuring hugepages for Oracle Database is getting a lot simpler.

Next In Series

  1. “[…] if you need hugepages”
  2. More on hugepages and NUMA
  3. Any pitfalls I find.

More Hugepages Articles

Link to Part II in this series: Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II. Link to Part III in this series: Configuring Linux Hugepages for Oracle Database is Just Too Difficult! Isn’t It? Part – III. And more: Quantifying hugepages Memory Savings with Oracle Database 11g Little Things Doth Crabby Make – Part X. Posts About Linux Hugepages Makes Some Crabby It Seems. Also, Words About Sizing Hugepages. Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g. Little Things Doth Crabby Make – Part VIII. Hugepage Support for Oracle Database 11g Sometimes Means Using The ipcrm Command. Ugh. Oracle Database 11g Automatic Memory Management – Part I.

Oracle Exadata Database Machine I/O Bottleneck Revealed At… 157 MB/s! But At Least It Scales Linearly Within Datasheet-Specified Bounds!

It has been quite a while since my last Exadata-related post. Since I spend all my time, every working day, on Exadata performance work this blogging dry-spell should seem quite strange to readers of this blog. However, for a while it seemed to me as though I was saturating the websphere on the topic and Exadata is certainly more than a sort of  Kevin’s Dog and Pony Show. It was time to let other content filter up on the Google search results. Now, having said that, there have been times I’ve wished I had continued to saturate the namespace on the topic because of some of the totally erroneous content I’ve seen on the Web.

Most of the erroneous content is low-balling Exadata with FUD, but a surprisingly sad amount of content that over-hypes Exadata exists as well. Both types of erroneous content are disheartening to me given my profession. In actuality, the hype content is more disheartening to me than the FUD. I understand the motivation behind FUD, however, I cannot understand the need to make a good thing out to be better than it is with hype. Exadata is, after all, a machine with limits folks. All machines have limits. That’s why Exadata comes in different size configurations  for heaven’s sake! OK, enough of that.

FUD or Hype? Neither, Thank You Very Much!
Both the FUD-slinging folks and the folks spewing the ueber-light-speed, anti-matter-powered warp-drive throughput claims have something in common—they don’t understand the technology.  That is quickly changing though. Web content is popping up from sources I know and trust. Sources outside the walls of Oracle as well. In fact, two newly accepted co-members of the OakTable Network have started blogging about their Exadata systems. Kerry Osborne and Frits Hoogland have been posting about Exadata lately (e.g., Kerry Osborne on Exadata Storage Indexes).

I’d like to draw attention to Frits Hoogland’s investigation into Exadata. Frits is embarking on a series that starts with baseline table scan performance on a half-rack Exadata configuration that employs none of the performance features of Exadata (e.g., storage offload processing disabled). His approach is to then enable Exadata features and show the benefit while giving credit to which specific aspect of Exadata is responsible for the improved throughput. The baseline test in Frits’ series is achieved by disabling both Exadata cell offload processing and Parallel Query Option! To that end, the scan is being driven by a single foreground process executing on one of the 32 Intel Xeon 5500 (Nehalem EP) cores in his half-rack Database Machine.

Frits cited throughput numbers but left out what I believe is a critical detail about the baseline result—where was the bottleneck?

In Frits’ test, a single foreground process drives the non-offloaded scan at roughly 157MB/s. Why not 1,570MB/s (I’ve heard everything Exadata is supposed to be 10x)? A quick read of any Exadata datasheet will suggest that a half-rack Version 2 Exadata configuration offers up to 25GB/s scan throughput (when scanning both HDD and FLASH storage assets concurrently). So, why not 25 GB/s? The answer is that the flow of data has to go somewhere.

In Frits’ particular baseline case the data is flowing from cells via iDB (RDS IB) into heap-buffered PGA in a single foreground executing on a single core on a single Nehalem EP processor. Along with that data flow is the CPU cost paid by the foreground process in its marshalling all the I/O (communicating with Exadata via the intelligent storage layer) which means interacting with cells to request the ASM extents as per its mapping of the table segments to ASM extents (in the ASM extent map). Also, the particular query being tested by Frits performs a count(*) and predicates on a column. To that end, a single core in that single Nehalem EP socket is touching every row in every block for predicate evaluation. With all that going on, one should not expect more than 157MB/s to flow through a single Xeon 5500 core. That is a lot of code execution.

What Is My Point?
The point is that all systems have bottlenecks somewhere. In this case, Frits is creating a synthetic CPU bottleneck as a baseline in a series of tests. The only reason I’m blogging the point is that Frits didn’t identify the bottleneck in that particular test. I’d hate to see the FUD-slingers suggest that a half-rack Version 2 Exadata configuration bottlenecks at 157 MB/s for disk throughput related reasons about as badly as I’d hate to see the hype-spewing-light-speed-anti-matter-warp rah-rah folks suggest that this test could scale up without bounds. I mean to say that I would hate to see someone blindly project how Frits’ baseline test would scale with concurrent invocations. After all, there are 8 cores, 16 threads on each host in the Version 2 Database Machine and therefore 32/64 in a half rack (there are 4 hosts). Surely Frits could invoke 32 or 64 sessions each performing this query without exhibiting any bottlenecks, right? Indeed, 157 MB/s by 64 sessions is about 10 GB/s which fits within the datasheet claims. And, indeed, since the memory bandwidth in this configuration is about 19 GB/s into each Nehalem EP socket there must surely be no reason this query wouldn’t scale linearly, right? The answer is I don’t have the answer. I haven’t tested it. What I would not advise, however, is dividing maximum theoretical arbitrary bandwidth figures (e.g., the 25GB/s scan bandwidth offered by a half-rack) by a measured application throughput requirement  (e.g., Frits’ 157 MB/s) and claim victory just because the math happens to work out in your favor. That would be junk science.

Frits is not blogging junk science. I recommend following this fellow OakTable member to see where it goes.

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part I.

In May 2009 I made a blog entry entitled You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II. There had not yet been a Part I but as I pointed out in that post I would loop back and make Part I. Here it is. Better late than never.

Background
I originally planned to use Part I to stroll down memory lane (back to 1995) with a story about the then VP of Oracle RDBMS Development’s initial impression about the Sequent DYNIX/ptx NUMA API during a session where we presented it and how it would be beneficial to code to NUMA APIs sooner rather than later. We were mixing vision with the specific need of our port to be honest.

We were the first to have a production NUMA API to which Oracle could port and we were quite a bit sooner to the whole NUMA trend than anyone else. Our’s was the first production NUMA system.

Now, this VP is no longer at Oracle but the  (redacted) response was, “Why would we want to use any of this ^#$%.”  We (me and the three others presenting the API) were caught off guard. However, we all knew that the question was a really good question. There were still good companies making really tight, high-end SMPs with uniform memory.  Just because we (Sequent) had to move into NUMA architecture didn’t mean we were blind to the reality around us. However, one thing we knew for sure—all systems in the future would have NUMA attributes of varying levels. All our competition was either in varying stages of denial or doing what I like to refer to as “Poo-pooh it while you do it.” All the major players eventually came out with NUMA systems.  Some sooner, some later and the others died trying.

That takes us to Commodity NUMA and the new purpose of this “Part I” post.

Before I say a word about this Part I I’d like to point out that the concepts in Part II are of a “must-know” variety unless you relinquish your computing power to some sort of hosted facility where you don’t have the luxury of caring about the architecture upon which you run Oracle Database.

Part II was about the different types of NUMA (historical and present) and such knowledge will help you if you find yourself in a troubling performance situation that relates to NUMA. NUMA is commodity, as I point out, and we have to come to grips with that.

What Is He Blogging About?
The current state of commodity NUMA is very peculiar. These Commodity NUMA Implementations (CNI) systems are so tightly coupled that most folks don’t even realize they are running on a NUMA system. In fact, let me go out on a ledge. I assert that nobody is configuring Oracle Database 11g Release 2 with NUMA optimizations in spite of the fact that they are on a NUMA box (e.g., Nehalem EP, AMD Opterton). The reason I believe this is because the init.ora parameter to invoke Oracle NUMA awareness changed names from 11gR1 to 11gR2 as per My Oracle Support note 864633.1. The parameter changed from _enable_NUMA_optimization to enable_NUMA_support. I know nobody is setting this because if they had I can almost guarantee they would have googled for problems. Allow me to explain.

If Nobody is Googling It, Nobody is Doing It
Anyone who tests _enable_NUMA_support as per My Oracle Support note 864633.1 will likely experience the sorts of problems that I detail later in this post. But first, let’s see what they would get from google when they search for _enable_NUMA_support:

Yes, just as I thought…Google found nothing. But what is my point? My point is two-fold. First, I happen to know that Nehalem EP  with QPI and Opteron with AMD HyperTransport are such good technologies that you really don’t have to care that much about NUMA software optimizations. At least to this point of the game. Reading M.O.S note 1053332.1 (regards disabling Linux NUMA support for Oracle Database Machine hosts) sort of drives that point home. However, saying you don’t need to care about NUMA doesn’t mean you shouldn’t experiment. How can anyone say that setting _enable_NUMA_support is a total placebo in all cases? One can’t prove a negative.

If you dare, trust me when I say that an understanding of NUMA will be as essential in the next 10 years as understanding SMP (parallelism and concurrency) was in the last 20 years. OK, off my soapbox.

Some Lessons in Enabling Oracle NUMA Optimizations with Oracle Database 11g Release 2
This section of the blog aims to point out that even when you think you might have tested Oracle NUMA optimizations there is a chance you didn’t. You have to know the way to ensure you have NUMA optimizations in play. Why? Well, if the configuration is not right for enabling NUMA features, Oracle Database will simply ignore you. Consider the following session where I demonstrate the following:

  1. Evidence that I am on a NUMA system (numactl(8))
  2. I started up an instance with a pfile (p4.ora) that has _enable_NUMA_support set to TRUE
  3. The instance started but _enable_NUMA_support was forced back to FALSE

Note, in spite of event #3, the alert log will not report anything to you about what went wrong.

SQL>
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 36317 MB
node 0 free: 31761 MB
node 1 size: 36360 MB
node 1 free: 35425 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

SQL> startup pfile=./p4.ora
ORACLE instance started.

Total System Global Area 5746786304 bytes
Fixed Size                  2213216 bytes
Variable Size            1207962272 bytes
Database Buffers         4294967296 bytes
Redo Buffers              241643520 bytes
Database mounted.
Database opened.
SQL> show parameter _enable_NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     FALSE

SQL>
SQL> !grep _enable_NUMA_support ./p4.ora
_enable_NUMA_support=TRUE

OK, so the instance is up and the parameter was reverted, what does the IPC shared memory segment look like?

SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x7393f7f4 1179653    oracle    660        5773459456 35
0x00000000 393223     oracle    644        790528     5          dest
0x00000000 425992     oracle    644        790528     5          dest
0x00000000 458761     oracle    644        790528     5          dest

Right, so I have no NUMA placement of the buffer pool. On Linux, Oracle must create multiple segments and allocate them on specific NUMA nodes (memory hierarchies). It was a little simpler for the first NUMA-aware port of Oracle (Sequent) since the APIs allowed for the creation of a single shared memory segment with regions of the segment placed onto different memories. Ho Hum.

What Went Wrong
Oracle could not find the libnuma.so it wanted to link with dlopen():

$ grep libnuma /tmp/strace.out | grep ENOENT | head
14626 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)
14627 open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)

So I create the necessary symbolic link and subsequently boot the instance and inspect the shared memory segments. Here I see that I have a ~1GB segment for the variable SGA components and my buffer pool has been segmented into two roughly 2.3 GB segments.

# ls -l /usr/*64*/*numa*
lrwxrwxrwx 1 root root    23 Mar 17 09:25 /usr/lib64/libnuma.so -> /usr/lib64/libnuma.so.1
-rwxr-xr-x 1 root root 21752 Jul  7  2009 /usr/lib64/libnuma.so.1

SQL> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_size                        big integer 4G
SQL> show parameter NUMA_support

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_enable_NUMA_support                 boolean     TRUE
SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xed304ac0 229380     oracle    660        4096       0
0x00000000 2719749    oracle    660        1006632960 35
0x00000000 2752518    oracle    660        2483027968 35
0x00000000 393223     oracle    644        790528     6          dest
0x00000000 425992     oracle    644        790528     6          dest
0x00000000 458761     oracle    644        790528     6          dest
0x00000000 2785290    oracle    660        2281701376 35
0x7393f7f4 2818059    oracle    660        2097152    35

So there I have an SGA successfully created with _enable_NUMA_support set to TRUE. But, what strings appear in the alert log? Well, I’ll blog that soon because it leads me to other content.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 747 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: