I spent the majority of my time in the Oracle Database 11g Beta program testing storage-related aspects of the new release. To be honest, I didn’t even take a short peek at the new Automatic Memory Management feature. As I pointed out the other day, Tanel Poder has started blogging about the feature.
If you read Tanel’s post you’ll see that he points out AMM-style shared memory does not use hugepages. This is because AMM memory segments are memory mapped files in /dev/shm. At this time, the major Linux distributions do not implement backing memory mapped files with hugepages as they do with System V-style IPC shared memory. The latter supports the SHM_HUGETLB flag passed to the shmget(P) call. It appears as though there was an effort to get hugepages support for memory mapped pages by adding MAP_HUGETLB flag support for the mmap(P) call as suggested in this kernel developer email thread from 2004. I haven’t been able to find just how far that proposed patch went however. Nonetheless, I’m sure Wim’s group is more than aware of that proposed mmap(P) support and if it is really important for Oracle Database 11g Automatic Memory Management, it seem likely there would be a 2.6 Kernel patch for it someday. But that begs the question: just how important are hugepages? Is it blasphemy to even ask the question?
Memory Mapped Files and Oracle Ports
The concept of large page tables is a bit of a porting nightmare. It will be interesting to see how the other ports deal with OS-level support for the dynamic nature of Automatic Memory Management. Will the other ports also use memory mapped files instead of IPC Shared Memory? If so, they too will have spotty large page table support for memory mapped files. For instance, Solaris 9 supported large page tables for mmap(2) pages, but only if it was an anonymous mmap (e.g., a map without a file) or a map of /dev/zero-neither of which would work for AMM. I understand that Solaris 10 supports large page tables for mmap(2) regions that are MAP_SHARED mmap(2)s of files-which is most likely how AMM will look on Solaris, but I’m only guessing. Other OSes, like Tru64-and I’m quite sure most others-don’t support large page tables for mmap(2)ed files. This will be interesting to watch.
Performance, Large Page Table, Etc
I remember back in the mid-90s when Sequent implemented shared large page tables for IPC Shared memory on our Unix variant-DYNIX/ptx. It was a very significant performance enhancement. For instance, 1024 shadow processes attached to a 1GB SGA required 1GB of physical memory-for the page tables alone! That was significant on systems that had very small L2 caches and only supported 4GB physical memory. Fast forwarding to today. I know people with Oracle 10g workloads that absolutely seize up their Linux (2.6. Kernel) system unless they use hugepages. Now I should point out that these sites I know of have a significant mix of structured and unstructured data. That is, they call out to LOBs in the filesystem (give me SecureFiles please). So the pathology they generally suffered without hugepages was memory thrashing between Oracle and the OS page cache (filesystem buffer cache). The salve for those wounds was hugepages since that essentially carves out and locks down the memory at boot time. Hugepages memory can never be nibbled up for page cache. To that end, benefiting from hugepages in this way is actually a by-product. The true point behind hugepages not the fact that it is reserved at boot time, but the fact that CPUs don’t have to thrash to maintain the physical to virtual translations (tlb). In general, hugepages are a lot more polite on processor caches and they reduce RAM overhead for page tables. Compared to the mid 1990s, however, RAM is about the least of our worries these days. Manageability is the most important and AMM aims to help on that front.
Of all things Oracle and Linux, I think one of the topics that gets mangled the most is hugepages. The terms and nobs to twist run the gamut. There’s hugepages, hugetlb, hugetlbfs, hugetlbpool and so on. Then there are the differences from one Linux distro and Linux kernel to the other. For instance, you can’t use hugepages on SuSE unless you turn off vm.disable_cap_mlock (need a few double negatives?). Then there is the question of boot-time versus /proc or sysctl(8) to reserve the pages. Finally, there is the fact that if you don’t have enough hugepages when you boot Oracle, Oracle will not complain-you just don’t get hugepages. I think Metalink 361323.1 does a decent job explaining hugepages with old and recent Linux in mind, but I never see it explained as succinctly as follows:
- Use OEL 4 or RHEL 4 with Oracle Database 10g or 11g
- Set oracle hard memlock N in /etc/security/limits.conf where N is a value large enough to cover your SGA needs
- Set vm.nr_hugepages in /etc/sysctl.conf to a value large enough to cover your SGA.
Audited TPC results don’t help. For instance, on page 125 of this Full disclosure report from a recent Oracle10g TPC-C, there are listings of sysctl.conf and lilo showing the setting of the hugetlbpool parameter. That would be just fine if this was a RHEL3 benchmark since vm.hugetlbpool doesn’t exist in RHEL4.
I admit I haven’t done a great deal of testing with AMM, but generally a quick I/O-intensive OLTP test on a system with 4 processor cores utilized at 100% speak volumes to me. So I did just such a test.
Using an order-entry workload accessing the schema detailed in this Oracle Whitepaper about Direct NFS, I tested two configurations:
Automatic Memory Management (AMM). Just like it says, I configured the simplest set of initialization parameters I could:
UNDO_TABLESPACE=rb1 UNDO_MANAGEMENT = AUTO compatible = 10.1.0.0 control_files = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 ) db_block_size = 4096 MEMORY_TARGET=1500M db_files = 100 db_writer_processes = 1 db_name = bench processes = 200 sessions = 400 cursor_space_for_time = TRUE # pin the sql in cache filesystemio_options=setall
Manual Memory Management(MMM). I did my best to tailor the important SGA regions to match what AMM produced. In my mind, for an OLTP workload the most important SGA regions are the block buffers and the shared pool.
UNDO_TABLESPACE=rb1 UNDO_MANAGEMENT = AUTO compatible = 10.1.0.0 control_files = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 ) db_block_size = 4096 #MEMORY_TARGET=1500M db_cache_size = 624M shared_pool_size=224M db_files = 100 db_writer_processes = 1 db_name = bench processes = 200 sessions = 400 cursor_space_for_time = TRUE # pin the sql in cache filesystemio_options=setall
The following v$sgainfo output justifies just how closely configured the AMM and MMM cases were.
SQL> select * from v$sgainfo ; NAME BYTES RES -------------------------------- ---------- --- Fixed SGA Size 1298916 No Redo Buffers 11943936 No Buffer Cache Size 654311424 Yes Shared Pool Size 234881024 Yes Large Pool Size 16777216 Yes Java Pool Size 16777216 Yes Streams Pool Size 0 Yes Shared IO Pool Size 33554432 Yes Granule Size 16777216 No Maximum SGA Size 1573527552 No Startup overhead in Shared Pool 83886080 No NAME BYTES RES -------------------------------- ---------- --- Free SGA Memory Available 0
SQL> select * from v$sgainfo ;
NAME BYTES RES -------------------------------- ---------- --- Fixed SGA Size 1302592 No Redo Buffers 4964352 No Buffer Cache Size 654311424 Yes Shared Pool Size 234881024 Yes Large Pool Size 0 Yes Java Pool Size 25165824 Yes Streams Pool Size 0 Yes Shared IO Pool Size 29360128 Yes Granule Size 4194304 No Maximum SGA Size 949989376 No Startup overhead in Shared Pool 75497472 No NAME BYTES RES -------------------------------- ---------- --- Free SGA Memory Available 0
The server was a HP DL380 with 4 processor cores and the storage was an HP EFS Clustered Gateway NAS. Before each test I did the following:
- Restore Database
- Reboot Server
- Mount NFS filesystems
- Boot Oracle
Before the MMM case I set vm.nr_hugepages=600 and after the database was booted, hugepages utilization looked like this:
$ grep Huge /proc/meminfo HugePages_Total: 600 HugePages_Free: 145 Hugepagesize: 2048 kB
So, given all these conditions, I believe I am making an apples-apples comparison of AMM to MMM where AMM does not get hugepages support but MMM does. I think this is a pretty stressful workload since I am maxing out the processors and performing a significant amount of I/O-given the size of the server.
OK, so this is a very contained case and Oracle Database 11g is still only available on x86 Linux. I hope I can have the time to do a similar test with more substantial gear. For the time being, what I know is that losing hugepages support for the sake of gaining AMM should not make you lose sleep. The results measured in throughput (transactions per second) and server statistics are in:
|Configuration||OLTP Transactions/sec||Logical IO/sec||Block Changes/sec||Physical Read/sec||Physical Write/sec|
Looks like 4% in the favor of AMM to me and that is likely attributed to the 13% more physical I/O per transaction the MMM case had to perform. That part of the results has me baffled for the moment since they both have the same buffering as the v$sgainfo output above shows. Well, yes, there is a significant difference in the amount of Large Pool in the MMM case, but this workload really shouldn’t have any demand on Large Pool. I’m going to investigate that further. Perhaps an interesting test would be to reduce the amount buffering the AMM case gets to force more physical I/O. That could bring it more in line. We’ll see.
I’m not saying hugepages is no help across the board. What I am saying is that I would weigh heavily the benefits AMM offers because losing hugepages might not make any difference for you at all. If it is, in fact, a huge problem across the board then it looks like there has been work done in this area for the 2.6 Kernel and it seems reasonable that such a feature (hugepages support for mmap(P)) could be implemented. We’ll see.