Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g. | Kevin Closson's Blog: Platforms, Databases and Storage

Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g.

Published July 25, 2009 oracle 18 Comments

Recently I had someone ask me in email why I bother posting installments on my Little Things Doth Crabby Make series. I responded by saying I think it is valuable to IT professionals to know they are not alone when confronted by something that makes little sense, or makes them crabby if that be the case. It’s all about the Wayward Googler(tm).

Well, Wayward Googler, it’s coming on thick.

Using Memory and Then Allocating HugePages (Or Die Trying)
I purposefully booted my system with no hugepages allocated in /etc/sysctl.conf (vm.nr_hugepages = 0). I then booted an Oracle Database 11g instance with sga_target set to 8000M. Next, I fired off 500 dedicated connections using the following goofy stuff:


$ cat doit
cnt=0
until [ $cnt -eq 500 ]
do
   sqlplus rw/rw @foo.sql &
   (( cnt = $cnt + 1 ))
done

wait

$ cat foo.sql
HOST sleep 120
exit;

The script ran in a matter of moments since I’m using a Xeon 5500 (Nehalem) based dual-socket server running Linux with a 2.6 kernel. Yes, these processors are really, really fast. But that, of course, isn’t what made me crabby.

Directly before I invoked the script, that fired off my 500 dedicated connections, I executed a script that intermittently peeked at how much memory is being wasted on page tables. Remember, without hugepages (hugetlb) backed IPC Shared Memory for the SGA there will be page table overhead for every connection to the instance. The size of the SGA and the number of dedicated connections compounds to consume potentially significant amounts of memory. Although that is also not what made me crabby, let’s look at what 500 dedicated sessions attaching to an 8000 MB SGA looks like as the user count ramps up:


$ while true
> do
> grep PageTables /proc/meminfo
> sleep 10
> done

PageTables:       3764 kB
PageTables:       4696 kB
PageTables:      65848 kB
PageTables:     176956 kB
PageTables:     287616 kB
PageTables:     366540 kB
PageTables:     478224 kB
PageTables:     588424 kB
PageTables:     699832 kB
PageTables:     792356 kB
PageTables:     802468 kB
PageTables:     834004 kB
PageTables:     851980 kB
PageTables:     835432 kB
PageTables:     834948 kB
PageTables:     835052 kB
PageTables:    1463260 kB
PageTables:    2072864 kB
PageTables:    2679572 kB
PageTables:    3283456 kB
PageTables:    3892628 kB
PageTables:    4496868 kB
PageTables:    5100908 kB
PageTables:    6846256 kB
PageTables:    6866820 kB
PageTables:    6829388 kB
PageTables:    6874752 kB
PageTables:    6879360 kB
PageTables:    6883076 kB
PageTables:    6895244 kB
PageTables:    6901528 kB
PageTables:    6917256 kB
PageTables:    6927984 kB
PageTables:    6999196 kB
PageTables:    6999472 kB
PageTables:    7000048 kB
PageTables:    7088160 kB
PageTables:    7087960 kB
PageTables:    7088812 kB
PageTables:    7132804 kB
PageTables:    7121120 kB

Got Spare Memory? Good, Don’t Use Hugepages
Uh, just short of 7 GB of physical memory lost to page tables! That’s ugly, but that’s not what made me crabby. Before I forget, did I mention that it is a really good idea to back your SGA with hugepages if you are running a lot of dedicated connections and have a large SGA?

So, What Did Make Him Crabby Anyway?
Wasting all that physical memory with page tables was just part of some analysis I’m doing. I never aim to waste memory (nor processor cycles for TLB misses) like that. So, I shut my Oracle Database 11g instance down in order to implement hugepages and move on. This is where I started getting crabby.

The first thing I did was verify there were, in fact, no allocated hugepages. Next, I checked to see if I had enough free memory to mess with. In this case I had most of the 16GB physical memory free. So, I tried to allocate 6200 2MB hugepages by echoing the token into /proc. Finally, I checked to make sure I was granted the hugepages I requested…Irk. Now that, made me crabby. Instead of 6200 I was given what appears to be some random number someone pulled out of the clothes hamper—604 hugepages:

# grep HugePages /proc/meminfo
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
# free
             total       used       free     shared    buffers     cached
Mem:      16427876     422408   16005468          0      24104     209060
-/+ buffers/cache:     189244   16238632
Swap:      2097016      29836    2067180
# echo 6200 > /proc/sys/vm/nr_hugepages
# grep HugePages /proc/meminfo
HugePages_Total:   604
HugePages_Free:    604
HugePages_Rsvd:      0

So, I then checked to see what free memory looked like:

# free
             total       used       free     shared    buffers     cached
Mem:      16427876    1670400   14757476          0      27040     207924
-/+ buffers/cache:    1435436   14992440
Swap:      2097016      29696    2067320

Clearly I was granted that oddball 604 hugepages I didn’t ask for. Maybe I’m supposed to just take what I’m given and be happy?

Please Sir, May I Have Some More?

I thought, perhaps the system just didn’t hear me clearly. So, without changing anything I just belligerently repeated my command and found that doing so increased my allocated hugepages by a whopping 2:

# echo 6200 > /proc/sys/vm/nr_hugepages
# grep HugePages /proc/meminfo
HugePages_Total:   608
HugePages_Free:    608
HugePages_Rsvd:      0

I began to wonder if there was some reason 6200 was throwing the system a curve-ball. Here’s what happened when I lowered my expectations by requesting 3100:

# echo 3100 > /proc/sys/vm/nr_hugepages;grep HugePages /proc/meminfo
HugePages_Total:   610
HugePages_Free:    610
HugePages_Rsvd:      0

Great. I began to wonder how long I could continually whack my head against the wall picking up little bits and pieces of hugepages along the way. So, I scripted 1000 consecutive requests for hugepages. I thought, perhaps, it was necessary to really, really want those hugepages:

# cnt=0;until [ $cnt -eq 1000 ]
> do
> echo 6200 > /proc/sys/vm/nr_hugepages
> (( cnt = $cnt + 1 ))
> done
# grep HugePages /proc/meminfo
HugePages_Total:  5502
HugePages_Free:   5502
HugePages_Rsvd:      0

Brilliant! Somewhere along the way the system decided to start doling out more than those piddly 2-page allocations in response to my request for 6200, otherwise I would have exited this loop with 2,610 hugepages. Instead, I exited the loop with 5502.

Well, since some is good, more must be better. I decided to run that stupid loop again just to see if I could pick up any more crumbs:

# cnt=0;until [ $cnt -eq 1000 ]; do echo 6200 > /proc/sys/vm/nr_hugepages; (( cnt = $cnt + 1 )); done
# grep PageTables /proc/meminfo
PageTables:       7472 kB
# grep '^Hu' /proc/meminfo
HugePages_Total:  5742
HugePages_Free:   5742
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

That makes me crabby.

Summary:
We should all do ourselves a favor and make sure we boot our servers with sufficient hugepages to cover our SGA(s). And, of course, you don’t get hugepages if you use Automatic Memory Management.

18 Responses to “Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g.”

Feed for this Entry Trackback Address

1 Noons July 25, 2009 at 12:01 pm

🙂

It’s hearign things like this that make me happy to be using AIX: largepage support there is childplay compared to this iterative process. See my latest post right on this subject.

Mind you: I thought “largepages” was for pages up to around 16MB, then hugepages for things like >1GB page size? Is that the case as well with Linux?

It’s amazing how much difference to performance an apparently small detail – such as the number of pages of virtual memory – can make.

Mind you: we’re taming a thing called TM1 from Cognos that absolutely refuses to run efficiently in *n*x, no matter how big we make the pages.

It uses “only” 128GB of memory – one of those “everything in memory”, “efficient” databases…

Runs like a charm – well, within limits! – in Windoze, but it just refuses to behave in AIX. It is a disaster in Solaris. Very little doco on it anywhere…

Hey: that Power6 stuff from IBM rocks: runs 10g with amazing speed. 6GHz cpu and counting. Love it!

Reply
- 2 kevinclosson July 25, 2009 at 4:34 pm
  
  It’s always great when you stop by, Noons… ‘sbeen a while.
  
  I couldn’t come up with anything bad to say about Power if I tried.
  
  Wow, I’d sure love to see AWR reports from that cognos workload…sounds like a doozie. Can you email me some?
  
  As for x86_64 Linux, the largest page supported is 2MB. I recall way back in the PentiumPro days 4MB page support. IA64 has really large page support (I believe 256MB but I’d have to look that up right quick)… I think these are all pretty small, but any form of large pages is crucial with oracle for the TLB relief.
  
  Reply
  - 3 Noons July 26, 2009 at 6:33 am
    
    Sorry haven’t been around so much. Recession and all that, we’re flat out trying to justify our salaries…
    
    This TM1 stuff is almost Oracle agnostic. Well, it reads off our DW and then builds the BI cubes from that, so I can’t really say it is totally agnostic. But nearly.
    
    Its load phase rarely even registers as a blip in our DW. Around 1.5 hours startup time on the Wintel box, with the AIX db mostly in sqlnet client wait – Oracle is not the bottleneck.
    
    It sucks all its data into memory, then builds daily cubes for general BI work, again in-memory.
    
    Supposedly it is very efficient. I have yet to see one of these “all-db-in-memory” applications that deserves the adjective “efficient”, but I’m happy to stand corrected. Whenever…
    
    Still: a RPITA. It’s running into all sorts of limits of our Wintel boxes with that huge 128GB-and-growing memory attitude and it runs 10 times slower in anything else!
    
    I’m hoping now that IBM owns Cognos, they’ll fix this monster. And make it behave in anything non-Wintel. But one never knows what not-so-big-anymore-blue is up to, next…
    
    Anyways: if we find a way to make this thing behave I’ll definitely make a post on it.
    
    Reply
4 Bernd Eckenfels July 25, 2009 at 6:35 pm

Looks like this loop was freeing the page tables for the “huged” pages, and therefore on each try you had more room to create new huge pages, right? I You dit not show the pagetable size after you terminated Oracle, how does it look like?

Bernd

Reply
- 5 kevinclosson July 25, 2009 at 11:08 pm
  
  Bernd,
  
  Let me fold your thoughts into the test and let’s see what happens. Good points.
  
  Reply
6 David Kanter July 25, 2009 at 8:17 pm

Hey Kevin,

That’s some awfully strange behavior…I wonder what the root cause for that peculiar memory allocation strategy.

Just a few tidbits about HW support for large pages (which is a great tool for improving performance on many memory intensive workloads). First of all, CPU designers definitely have been taken notice…but I think in many cases it takes the SW guys a bit longer to catch up.

1. Itanium supported 256MB pages, Itanium 2 supports 4GB pages.

2. The POWER6 has 8KB, 64KB and 16MB pages, which can all share TLB entries; supposedly there are also 16GB pages, but I haven’t really seen much about the latter and I don’t think they can use the same TLB entries as the other pages.

Also, it’s good to keep in mind that PPC has a three level address translation (real–>effective–>virtual), while other architectures have two levels (real–>virtual). So missing in the ERAT (IBM’s equivalent of a TLB) forces a look up into two address translation tables.

PPC and IPF both have peculiar page table formats (IPF has the virtually hashed PT and IBM has the hashed PT).

3. All x86’s support 4KB, 2MB and 4MB pages. The TLBs use two entries for a 4MB page though, so using 4MB pages doesn’t increase the amount of memory that the TLBs can cover.

3. AMD’s Barcelona and Shanghai support 1GB pages in their TLB (for data only). Most software does not take advantage of this – I doubt linux and windows support it. Barcelona and Shanghai generally have more large TLB entries than Intel CPUs.

4. Nehalem does not support 1GB pages (probably one reason that Windows and Linux don’t either) and generally doesn’t have quite as impressive support for large pages as AMD. Nehalem’s TLBs can hold 32 large (2MB) pages.

5. I don’t recall much about TLBs for SPARC processors. However, SPARC supports 8KB, 64KB, 512KB and 4MB page sizes. Possibly larger pages have been added more recently.

David

Reply
- 7 kevinclosson July 25, 2009 at 11:11 pm
  
  Hi David,
  
  Thanks for stopping by. Yes, AMD has traditionally done larger large pages than x86(64) Intel… Interesting to see Barcelona 1GB pages… I can’t imagine Linux or Windows coding to that though… we’ll see… and, yes, CMT SPARC supports very, very large pages…I’d have to look up just how very large that is…unless my old friend Glenn Fawcett can chime in as he commits that stuff to memory.
  
  Reply
8 John Hurley July 25, 2009 at 11:31 pm

My impression is that you should configure the number of huge pages that you want and then reboot … don’t they stay reserved that way?

Someone ( Werner Puschitz I think ) has a guide to configuring memory for oracle on Red Hat Enterprise ( aka OEL ) … did you follow the steps recommended in that guide ( which if I remember correctly did call for a reboot )?

Reply
- 9 kevinclosson July 26, 2009 at 1:37 am
  
  Hi John,
  
  Agree with you…infact, I’ll quote my summary in the blog post you are commenting on:
  
  We should all do ourselves a favor and make sure we boot our servers with sufficient hugepages to cover our SGA(s). And, of course, you don’t get hugepages if you use Automatic Memory Management.
  
  🙂
  
  Reply
10 Radu Radutiu July 27, 2009 at 8:46 am

Have you tried clearing the system caches before trying to allocate hugepages:
echo 3 >/proc/sys/vm/drop_caches
echo 6200 > /proc/sys/vm/nr_hugepages

Reply
- 11 kevinclosson July 27, 2009 at 3:20 pm
  
  Hello Radu,
  
  No I didn’t think to try that. Mostly because the free command showed me over 97% free memory. I should think there’d be nothing to flush…
  
  # free # total used free shared buffers cached # Mem: 16427876 422408 16005468 0 24104 209060 # -/+ buffers/cache: 189244 16238632 # Swap: 2097016 29836 2067180
  
  Reply
  - 12 Radu Radutiu July 27, 2009 at 8:26 pm
    
    Hi,
    echo 3 > /proc/sys/vm/drop_caches
    will free pagecache, dentries and inodes. I’ve also seen advice to run sync first in order to make sure all objects are free. This will bring the system as close to the state after boot as possible.
    If I understand correctly huge pages need contiguous blocks of memory. Memory allocated to filesystem cache is in fact used by the kernel but it wll be released as applications need it. On my system dropping the caches will allow me to allocate more memory to huge pages and the allocation will be faster. Trying to allocate huge pages with a large filesystem cache takes forever on my system and involves a lot of I/O to disk).
    
    Reply
    - 13 kevinclosson July 27, 2009 at 9:04 pm
      
      Interesting, Radu. I have to say that I don’t much like that drop_caches thing. I’ve trolled the web for it briefly (it is pretty new) and there are a lot of folks saying to sync first just as you’ve said here. Personally, I think that if the token “3” magically plops into memory via /proc/sys/vm/drop_caches the kernel should kindly sync for me. Seems a little safer. Well, I guess the moniker is “drop.” However, if the name was “destroy” I’d be a little less surprised at the concerns over syncing. It just seems weird, and in the end still doesn’t make me less crabby. The way I see it, if I echo 6200 into hugepages I should either get 6200 or none at all…none of this piecemeal granting of a page or two here and there as my loop demonstrated.
      
      Reply
14 Chris Slattery August 5, 2009 at 3:32 pm

Metalink has just come out with Note 401749.1 to autocalc recommended hugepages number

fyi

Reply
15 Bernd Eckenfels August 14, 2009 at 4:57 am

i don’t think it drops dirty blocks, so not syncing is safe, it just won’t cleanup all possible caches.

Reply
16 Jim April 7, 2011 at 8:13 pm

Please check Metalink Note 361323.1. One lesson I learned while setting hugepages is to make sure your database is up and running before you run the script, hugepages_settings.sh.

Reply

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage