Archive for the 'oracle hugepages' Category

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – I.

Allocating hugepages for Oracle Database on Linux can be tricky. The following is a short list of some of the common problems associated with faulty attempts to get things properly configured:

  1. Insufficient Hugepages.You can be short just a single 2MB hugepage at instance startup and Oracle will silently fall back to no hugepages. For instance, if an instance needs 10,000 hugepages but there are only 9,999 available at startup Oracle will create non-hugepages IPC shared memory and the 9,999 (x 2MB) is just wasted memory.
    1. Insufficient hugepages is an even more difficult situation when booting with _enable_NUMA_support=TRUE as partial hugepages backing is possible.
  2. Improper Permissions. Both limits.conf(5) memlock and the shell ulimit –l must accommodate the desired amount of locked memory.

In general, list item 1 above has historically been the most difficult to deal with—especially on systems hosting several instances of Oracle. Since there is no way to determine whether an existing segment of shared memory is backed with hugepages, diagnostics are in short supply. Oracle Database 11g Release 2 (11.2.0.2) The fix for Oracle bugs 9195408 (unpublished) and 9931916 (published) is available in 11.2.0.2. In a sort of fast forward to the past, the Linux port now supports an initialization parameter to force the instance to use hugepages for all segments or fail to boot. I recall initialization parameters on Unix ports back in the early 1990s that did just that. The initialization parameter is called use_large_pages and setting it to “only” results in the all or none scenario. This, by the way, addresses list item 1.1 above. That is, setting use_large_pages=only ensures an instance will not have some NUMA segments backed with hugepages and others without. Consider the following example. Here we see that use_large_pages is set to “only” and yet the system has only a very small number of hugepages allocated (800 == ~1.6GB). First I’ll boot the instance using an init.ora file that does not force hugepages and then move on to using the one that does. Note, this is 11.2.0.2.

$ sqlplus '/ as sysdba'

SQL*Plus: Release 11.2.0.2.0 Production on Tue Sep 28 08:10:36 2010

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>
SQL> !grep -i huge /proc/meminfo
HugePages_Total:   800
HugePages_Free:    800
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
SQL>
SQL> !grep large_pages y.ora x.ora
use_large_pages=only
SQL>
SQL> startup force pfile=./x.ora
ORACLE instance started.

Total System Global Area 4.4363E+10 bytes
Fixed Size                  2242440 bytes
Variable`Size            1406199928 bytes
Database Buffers         4.2950E+10 bytes
Redo Buffers                4427776 bytes
Database mounted.
Database opened.
SQL> HOST date
Tue Sep 28 08:13:23 PDT 2010

SQL>  startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

The user feedback is a trite ORA-27102. So the question is,  which memory cannot be allocated? Let’s take a look at the alert log:

Tue Sep 28 08:16:05 2010
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 800 free: 800)
DFLT Huge Pages allocation successful (allocated: 512)
Huge Pages allocation failed (free: 288 required: 10432)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 3)
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 192)
NUMA Huge Pages allocation on node (1) (allocated: 64)

That is good diagnostic information. It informs us that the variable portion of the SGA was successfully allocated and backed with hugepages. It just so happens that my variable SGA component is precisely sized to 1GB. That much is simple to understand. After creating the segment for the variable SGA component Oracle moves on to create the NUMA buffer pool segments. This is a 2-socket Nehalem EP system and Oracle allocates from the Nth NUMA node and works back to node 0. In this case the first buffer pool creation attempt is for node 1 (socket 1). However, there were insufficient hugepages as indicated in the alert log. In the following example I allocated  another arbitrarily insufficient number of hugepages and tried to start an instance with use_large_pages=only. This particular insufficient hugepages scenario allows us to see more interesting diagnostics:

SQL>  !grep -i huge /proc/meminfo
HugePages_Total: 12000
HugePages_Free:  12000
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

SQL> startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

…and, the alert log:

Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 12000 free: 12000)
DFLT Huge Pages allocation successful (allocated: 512)
NUMA Huge Pages allocation on node (1) (allocated: 10432)
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 5184)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (0) (allocated: 704)
NUMA Huge Pages allocation on node (0) (allocated: 320)

In this example we see 12,000 hugepages was sufficient to back the variable SGA component and only 1 of the NUMA buffer pools (remember this is Nehalem EP with OS boot string numa=on).

Summary

In my opinion, this is a must-set parameter if you need hugepages. With initialization parameters like use_large_pages, configuring hugepages for Oracle Database is getting a lot simpler.

Next In Series

  1. “[…] if you need hugepages”
  2. More on hugepages and NUMA
  3. Any pitfalls I find.

More Hugepages Articles

Link to Part II in this series: Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II. Link to Part III in this series: Configuring Linux Hugepages for Oracle Database is Just Too Difficult! Isn’t It? Part – III. And more: Quantifying hugepages Memory Savings with Oracle Database 11g Little Things Doth Crabby Make – Part X. Posts About Linux Hugepages Makes Some Crabby It Seems. Also, Words About Sizing Hugepages. Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g. Little Things Doth Crabby Make – Part VIII. Hugepage Support for Oracle Database 11g Sometimes Means Using The ipcrm Command. Ugh. Oracle Database 11g Automatic Memory Management – Part I.

Oracle11g Automatic Memory Management – Part I. Linux Hugepages Support.

I spent the majority of my time in the Oracle Database 11g Beta program testing storage-related aspects of the new release. To be honest, I didn’t even take a short peek at the new Automatic Memory Management feature. As I pointed out the other day, Tanel Poder has started blogging about the feature.

If you read Tanel’s post you’ll see that he points out AMM-style shared memory does not use hugepages. This is because AMM memory segments are memory mapped files in /dev/shm. At this time, the major Linux distributions do not implement backing memory mapped files with hugepages as they do with System V-style IPC shared memory. The latter supports the SHM_HUGETLB flag passed to the shmget(P) call. It appears as though there was an effort to get hugepages support for memory mapped pages by adding MAP_HUGETLB flag support for the mmap(P) call as suggested in this kernel developer email thread from 2004. I haven’t been able to find just how far that proposed patch went however. Nonetheless, I’m sure Wim’s group is more than aware of that proposed mmap(P) support and if it is really important for Oracle Database 11g Automatic Memory Management, it seems likely there would be a 2.6 Kernel patch for it someday. But that begs the question: just how important are hugepages? Is it blasphemy to even ask the question?

Memory Mapped Files and Oracle Ports
The concept of large page tables is a bit of a porting nightmare. It will be interesting to see how the other ports deal with OS-level support for the dynamic nature of Automatic Memory Management. Will the other ports also use memory mapped files instead of IPC Shared Memory? If so, they too will have spotty large page table support for memory mapped files. For instance, Solaris 9 supported large page tables for mmap(2) pages, but only if it was an anonymous mmap (e.g., a map without a file) or a map of /dev/zero-neither of which would work for AMM. I understand that Solaris 10 supports large page tables for mmap(2) regions that are MAP_SHARED mmap(2)s of files-which is most likely how AMM will look on Solaris, but I’m only guessing. Other OSes, like Tru64-and I’m quite sure most others-don’t support large page tables for mmap(2)ed files. This will be interesting to watch.

Performance, Large Page Table, Etc
I remember back in the mid-90s when Sequent implemented shared large page tables for IPC Shared memory on our Unix variant-DYNIX/ptx. It was a very significant performance enhancement. For instance, 1024 shadow processes attached to a 1GB SGA required 1GB of physical memory-for the page tables alone! That was significant on systems that had very small L2 caches and only supported 4GB physical memory. Fast forwarding to today. I know people with Oracle 10g workloads that absolutely seize up their Linux (2.6. Kernel) system unless they use hugepages. Now I should point out that these sites I know of have a significant mix of structured and unstructured data. That is, they call out to LOBs in the filesystem (give me SecureFiles please). So the pathology they generally suffered without hugepages was memory thrashing between Oracle and the OS page cache (filesystem buffer cache). The salve for those wounds was hugepages since that essentially carves out and locks down the memory at boot time. Hugepages memory can never be nibbled up for page cache. To that end, benefiting from hugepages in this way is actually a by-product. The true point behind hugepages not the fact that it is reserved at boot time, but the fact that CPUs don’t have to thrash to maintain the physical to virtual translations (tlb). In general, hugepages are a lot more polite on processor caches and they reduce RAM overhead for page tables. Compared to the mid 1990s, however, RAM is about the least of our worries these days. Manageability is the most important and AMM aims to help on that front.

Confusion
Of all things Oracle and Linux, I think one of the topics that gets mangled the most is hugepages. The terms and nobs to twist run the gamut. There’s hugepages, hugetlb, hugetlbfs, hugetlbpool and so on. Then there are the differences from one Linux distribution and Linux kernel to the other. For instance, you can’t use hugepages on SuSE unless you turn off vm.disable_cap_mlock (need a few double negatives?). Then there is the question of boot-time versus /proc or sysctl(8) to reserve the pages. Finally, there is the fact that if you don’t have enough hugepages when you boot Oracle, Oracle will not complain-you just don’t get hugepages. I think Metalink 361323.1 does a decent job explaining hugepages with old and recent Linux in mind, but I never see it explained as succinctly as follows:

  1. Use OEL 4 or RHEL 4 with Oracle Database 10g or 11g
  2. Set oracle hard memlock N in /etc/security/limits.conf where N is a value large enough to cover your SGA needs
  3. Set vm.nr_hugepages in /etc/sysctl.conf to a value large enough to cover your SGA.

Further Confusion
Audited TPC results don’t help. For instance, on page 125 of this Full disclosure report from a recent Oracle10g TPC-C, there are listings of sysctl.conf and lilo showing the setting of the hugetlbpool parameter. That would be just fine if this was a RHEL3 benchmark since vm.hugetlbpool doesn’t exist in RHEL4.

Performance
I admit I haven’t done a great deal of testing with AMM, but generally a quick I/O-intensive OLTP test on a system with 4 processor cores utilized at 100% speak volumes to me. So I did just such a test.

Using an order-entry workload accessing the schema detailed in this Oracle Whitepaper about Direct NFS, I tested two configurations:

Automatic Memory Management (AMM). Just like it says, I configured the simplest set of initialization parameters I could:

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
MEMORY_TARGET=1500M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

Manual Memory Management(MMM). I did my best to tailor the important SGA regions to match what AMM produced. In my mind, for an OLTP workload the most important SGA regions are the block buffers and the shared pool.

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
#MEMORY_TARGET=1500M
db_cache_size = 624M
shared_pool_size=224M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

The following v$sgainfo output justifies just how closely configured the AMM and MMM cases were.

AMM:

SQL> select * from v$sgainfo ;

NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1298916 No
Redo Buffers                       11943936 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                    16777216 Yes
Java Pool Size                     16777216 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                33554432 Yes
Granule Size                       16777216 No
Maximum SGA Size                 1573527552 No
Startup overhead in Shared Pool    83886080 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

MMM:

SQL> select * from v$sgainfo ;
NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1302592 No
Redo Buffers                        4964352 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                           0 Yes
Java Pool Size                     25165824 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                29360128 Yes
Granule Size                        4194304 No
Maximum SGA Size                  949989376 No
Startup overhead in Shared Pool    75497472 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

The server was a HP DL380 with 4 processor cores and the storage was an HP EFS Clustered Gateway NAS. Before each test I did the following:

  1. Restore Database
  2. Reboot Server
  3. Mount NFS filesystems
  4. Boot Oracle

Before the MMM case I set vm.nr_hugepages=600 and after the database was booted, hugepages utilization looked like this:

$ grep Huge /proc/meminfo
HugePages_Total:   600
HugePages_Free:    145
Hugepagesize:     2048 kB

So, given all these conditions, I believe I am making an apples-apples comparison of AMM to MMM where AMM does not get hugepages support but MMM does. I think this is a pretty stressful workload since I am maxing out the processors and performing a significant amount of I/O-given the size of the server.

Test Results
OK, so this is a very contained case and Oracle Database 11g is still only available on x86 Linux. I hope I can have the time to do a similar test with more substantial gear. For the time being, what I know is that losing hugepages support for the sake of gaining AMM should not make you lose sleep. The results measured in throughput (transactions per second) and server statistics are in:

Configuration OLTP Transactions/sec Logical IO/sec Block Changes/sec Physical Read/sec Physical Write/sec
AMM 905 36,742 10,195 4,287 2,817
MMM 872 36,411 10,101 4,864 2,928

Looks like 4% in the favor of AMM to me and that is likely attributed to the 13% more physical I/O per transaction the MMM case had to perform. That part of the results has me baffled for the moment since they both have the same buffering as the v$sgainfo output above shows. Well, yes, there is a significant difference in the amount of Large Pool in the MMM case, but this workload really shouldn’t have any demand on Large Pool. I’m going to investigate that further. Perhaps an interesting test would be to reduce the amount buffering the AMM case gets to force more physical I/O. That could bring it more in line. We’ll see.

Summary
I’m not saying hugepages is no help across the board. What I am saying is that I would weigh heavily the benefits AMM offers because losing hugepages might not make any difference for you at all. If it is, in fact, a huge problem across the board then it looks like there has been work done in this area for the 2.6 Kernel and it seems reasonable that such a feature (hugepages support for mmap(P)) could be implemented. We’ll see.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 744 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: