Recently I had someone ask me in email why I bother posting installments on my Little Things Doth Crabby Make series. I responded by saying I think it is valuable to IT professionals to know they are not alone when confronted by something that makes little sense, or makes them crabby if that be the case. It’s all about the Wayward Googler(tm).
Well, Wayward Googler, it’s coming on thick.
Using Memory and Then Allocating HugePages (Or Die Trying)
I purposefully booted my system with no hugepages allocated in /etc/sysctl.conf (vm.nr_hugepages = 0). I then booted an Oracle Database 11g instance with sga_target set to 8000M. Next, I fired off 500 dedicated connections using the following goofy stuff:
$ cat doit cnt=0 until [ $cnt -eq 500 ] do sqlplus rw/rw @foo.sql & (( cnt = $cnt + 1 )) done wait $ cat foo.sql HOST sleep 120 exit;
The script ran in a matter of moments since I’m using a Xeon 5500 (Nehalem) based dual-socket server running Linux with a 2.6 kernel. Yes, these processors are really, really fast. But that, of course, isn’t what made me crabby.
Directly before I invoked the script, that fired off my 500 dedicated connections, I executed a script that intermittently peeked at how much memory is being wasted on page tables. Remember, without hugepages (hugetlb) backed IPC Shared Memory for the SGA there will be page table overhead for every connection to the instance. The size of the SGA and the number of dedicated connections compounds to consume potentially significant amounts of memory. Although that is also not what made me crabby, let’s look at what 500 dedicated sessions attaching to an 8000 MB SGA looks like as the user count ramps up:
$ while true > do > grep PageTables /proc/meminfo > sleep 10 > done PageTables: 3764 kB PageTables: 4696 kB PageTables: 65848 kB PageTables: 176956 kB PageTables: 287616 kB PageTables: 366540 kB PageTables: 478224 kB PageTables: 588424 kB PageTables: 699832 kB PageTables: 792356 kB PageTables: 802468 kB PageTables: 834004 kB PageTables: 851980 kB PageTables: 835432 kB PageTables: 834948 kB PageTables: 835052 kB PageTables: 1463260 kB PageTables: 2072864 kB PageTables: 2679572 kB PageTables: 3283456 kB PageTables: 3892628 kB PageTables: 4496868 kB PageTables: 5100908 kB PageTables: 6846256 kB PageTables: 6866820 kB PageTables: 6829388 kB PageTables: 6874752 kB PageTables: 6879360 kB PageTables: 6883076 kB PageTables: 6895244 kB PageTables: 6901528 kB PageTables: 6917256 kB PageTables: 6927984 kB PageTables: 6999196 kB PageTables: 6999472 kB PageTables: 7000048 kB PageTables: 7088160 kB PageTables: 7087960 kB PageTables: 7088812 kB PageTables: 7132804 kB PageTables: 7121120 kB
Got Spare Memory? Good, Don’t Use Hugepages
Uh, just short of 7 GB of physical memory lost to page tables! That’s ugly, but that’s not what made me crabby. Before I forget, did I mention that it is a really good idea to back your SGA with hugepages if you are running a lot of dedicated connections and have a large SGA?
So, What Did Make Him Crabby Anyway?
Wasting all that physical memory with page tables was just part of some analysis I’m doing. I never aim to waste memory (nor processor cycles for TLB misses) like that. So, I shut my Oracle Database 11g instance down in order to implement hugepages and move on. This is where I started getting crabby.
The first thing I did was verify there were, in fact, no allocated hugepages. Next, I checked to see if I had enough free memory to mess with. In this case I had most of the 16GB physical memory free. So, I tried to allocate 6200 2MB hugepages by echoing the token into /proc. Finally, I checked to make sure I was granted the hugepages I requested…Irk. Now that, made me crabby. Instead of 6200 I was given what appears to be some random number someone pulled out of the clothes hamper—604 hugepages:
# grep HugePages /proc/meminfo HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 # free total used free shared buffers cached Mem: 16427876 422408 16005468 0 24104 209060 -/+ buffers/cache: 189244 16238632 Swap: 2097016 29836 2067180 # echo 6200 > /proc/sys/vm/nr_hugepages # grep HugePages /proc/meminfo HugePages_Total: 604 HugePages_Free: 604 HugePages_Rsvd: 0
So, I then checked to see what free memory looked like:
# free total used free shared buffers cached Mem: 16427876 1670400 14757476 0 27040 207924 -/+ buffers/cache: 1435436 14992440 Swap: 2097016 29696 2067320
Clearly I was granted that oddball 604 hugepages I didn’t ask for. Maybe I’m supposed to just take what I’m given and be happy?
Please Sir, May I Have Some More?
I thought, perhaps the system just didn’t hear me clearly. So, without changing anything I just belligerently repeated my command and found that doing so increased my allocated hugepages by a whopping 2:
# echo 6200 > /proc/sys/vm/nr_hugepages # grep HugePages /proc/meminfo HugePages_Total: 608 HugePages_Free: 608 HugePages_Rsvd: 0
I began to wonder if there was some reason 6200 was throwing the system a curve-ball. Here’s what happened when I lowered my expectations by requesting 3100:
# echo 3100 > /proc/sys/vm/nr_hugepages;grep HugePages /proc/meminfo HugePages_Total: 610 HugePages_Free: 610 HugePages_Rsvd: 0
Great. I began to wonder how long I could continually whack my head against the wall picking up little bits and pieces of hugepages along the way. So, I scripted 1000 consecutive requests for hugepages. I thought, perhaps, it was necessary to really, really want those hugepages:
# cnt=0;until [ $cnt -eq 1000 ] > do > echo 6200 > /proc/sys/vm/nr_hugepages > (( cnt = $cnt + 1 )) > done # grep HugePages /proc/meminfo HugePages_Total: 5502 HugePages_Free: 5502 HugePages_Rsvd: 0
Brilliant! Somewhere along the way the system decided to start doling out more than those piddly 2-page allocations in response to my request for 6200, otherwise I would have exited this loop with 2,610 hugepages. Instead, I exited the loop with 5502.
Well, since some is good, more must be better. I decided to run that stupid loop again just to see if I could pick up any more crumbs:
# cnt=0;until [ $cnt -eq 1000 ]; do echo 6200 > /proc/sys/vm/nr_hugepages; (( cnt = $cnt + 1 )); done # grep PageTables /proc/meminfo PageTables: 7472 kB # grep '^Hu' /proc/meminfo HugePages_Total: 5742 HugePages_Free: 5742 HugePages_Rsvd: 0 Hugepagesize: 2048 kB
That makes me crabby.
Summary:
We should all do ourselves a favor and make sure we boot our servers with sufficient hugepages to cover our SGA(s). And, of course, you don’t get hugepages if you use Automatic Memory Management.
🙂
It’s hearign things like this that make me happy to be using AIX: largepage support there is childplay compared to this iterative process. See my latest post right on this subject.
Mind you: I thought “largepages” was for pages up to around 16MB, then hugepages for things like >1GB page size? Is that the case as well with Linux?
It’s amazing how much difference to performance an apparently small detail – such as the number of pages of virtual memory – can make.
Mind you: we’re taming a thing called TM1 from Cognos that absolutely refuses to run efficiently in *n*x, no matter how big we make the pages.
It uses “only” 128GB of memory – one of those “everything in memory”, “efficient” databases…
Runs like a charm – well, within limits! – in Windoze, but it just refuses to behave in AIX. It is a disaster in Solaris. Very little doco on it anywhere…
Hey: that Power6 stuff from IBM rocks: runs 10g with amazing speed. 6GHz cpu and counting. Love it!
It’s always great when you stop by, Noons… ‘sbeen a while.
I couldn’t come up with anything bad to say about Power if I tried.
Wow, I’d sure love to see AWR reports from that cognos workload…sounds like a doozie. Can you email me some?
As for x86_64 Linux, the largest page supported is 2MB. I recall way back in the PentiumPro days 4MB page support. IA64 has really large page support (I believe 256MB but I’d have to look that up right quick)… I think these are all pretty small, but any form of large pages is crucial with oracle for the TLB relief.
Sorry haven’t been around so much. Recession and all that, we’re flat out trying to justify our salaries…
This TM1 stuff is almost Oracle agnostic. Well, it reads off our DW and then builds the BI cubes from that, so I can’t really say it is totally agnostic. But nearly.
Its load phase rarely even registers as a blip in our DW. Around 1.5 hours startup time on the Wintel box, with the AIX db mostly in sqlnet client wait – Oracle is not the bottleneck.
It sucks all its data into memory, then builds daily cubes for general BI work, again in-memory.
Supposedly it is very efficient. I have yet to see one of these “all-db-in-memory” applications that deserves the adjective “efficient”, but I’m happy to stand corrected. Whenever…
Still: a RPITA. It’s running into all sorts of limits of our Wintel boxes with that huge 128GB-and-growing memory attitude and it runs 10 times slower in anything else!
I’m hoping now that IBM owns Cognos, they’ll fix this monster. And make it behave in anything non-Wintel. But one never knows what not-so-big-anymore-blue is up to, next…
Anyways: if we find a way to make this thing behave I’ll definitely make a post on it.
Looks like this loop was freeing the page tables for the “huged” pages, and therefore on each try you had more room to create new huge pages, right? I You dit not show the pagetable size after you terminated Oracle, how does it look like?
Bernd
Bernd,
Let me fold your thoughts into the test and let’s see what happens. Good points.
Hey Kevin,
That’s some awfully strange behavior…I wonder what the root cause for that peculiar memory allocation strategy.
Just a few tidbits about HW support for large pages (which is a great tool for improving performance on many memory intensive workloads). First of all, CPU designers definitely have been taken notice…but I think in many cases it takes the SW guys a bit longer to catch up.
1. Itanium supported 256MB pages, Itanium 2 supports 4GB pages.
2. The POWER6 has 8KB, 64KB and 16MB pages, which can all share TLB entries; supposedly there are also 16GB pages, but I haven’t really seen much about the latter and I don’t think they can use the same TLB entries as the other pages.
Also, it’s good to keep in mind that PPC has a three level address translation (real–>effective–>virtual), while other architectures have two levels (real–>virtual). So missing in the ERAT (IBM’s equivalent of a TLB) forces a look up into two address translation tables.
PPC and IPF both have peculiar page table formats (IPF has the virtually hashed PT and IBM has the hashed PT).
3. All x86’s support 4KB, 2MB and 4MB pages. The TLBs use two entries for a 4MB page though, so using 4MB pages doesn’t increase the amount of memory that the TLBs can cover.
3. AMD’s Barcelona and Shanghai support 1GB pages in their TLB (for data only). Most software does not take advantage of this – I doubt linux and windows support it. Barcelona and Shanghai generally have more large TLB entries than Intel CPUs.
4. Nehalem does not support 1GB pages (probably one reason that Windows and Linux don’t either) and generally doesn’t have quite as impressive support for large pages as AMD. Nehalem’s TLBs can hold 32 large (2MB) pages.
5. I don’t recall much about TLBs for SPARC processors. However, SPARC supports 8KB, 64KB, 512KB and 4MB page sizes. Possibly larger pages have been added more recently.
David
Hi David,
Thanks for stopping by. Yes, AMD has traditionally done larger large pages than x86(64) Intel… Interesting to see Barcelona 1GB pages… I can’t imagine Linux or Windows coding to that though… we’ll see… and, yes, CMT SPARC supports very, very large pages…I’d have to look up just how very large that is…unless my old friend Glenn Fawcett can chime in as he commits that stuff to memory.
My impression is that you should configure the number of huge pages that you want and then reboot … don’t they stay reserved that way?
Someone ( Werner Puschitz I think ) has a guide to configuring memory for oracle on Red Hat Enterprise ( aka OEL ) … did you follow the steps recommended in that guide ( which if I remember correctly did call for a reboot )?
Hi John,
Agree with you…infact, I’ll quote my summary in the blog post you are commenting on:
🙂
Have you tried clearing the system caches before trying to allocate hugepages:
echo 3 >/proc/sys/vm/drop_caches
echo 6200 > /proc/sys/vm/nr_hugepages
Hello Radu,
No I didn’t think to try that. Mostly because the free command showed me over 97% free memory. I should think there’d be nothing to flush…
# free
# total used free shared buffers cached
# Mem: 16427876 422408 16005468 0 24104 209060
# -/+ buffers/cache: 189244 16238632
# Swap: 2097016 29836 2067180
Hi,
echo 3 > /proc/sys/vm/drop_caches
will free pagecache, dentries and inodes. I’ve also seen advice to run sync first in order to make sure all objects are free. This will bring the system as close to the state after boot as possible.
If I understand correctly huge pages need contiguous blocks of memory. Memory allocated to filesystem cache is in fact used by the kernel but it wll be released as applications need it. On my system dropping the caches will allow me to allocate more memory to huge pages and the allocation will be faster. Trying to allocate huge pages with a large filesystem cache takes forever on my system and involves a lot of I/O to disk).
Interesting, Radu. I have to say that I don’t much like that drop_caches thing. I’ve trolled the web for it briefly (it is pretty new) and there are a lot of folks saying to sync first just as you’ve said here. Personally, I think that if the token “3” magically plops into memory via /proc/sys/vm/drop_caches the kernel should kindly sync for me. Seems a little safer. Well, I guess the moniker is “drop.” However, if the name was “destroy” I’d be a little less surprised at the concerns over syncing. It just seems weird, and in the end still doesn’t make me less crabby. The way I see it, if I echo 6200 into hugepages I should either get 6200 or none at all…none of this piecemeal granting of a page or two here and there as my loop demonstrated.
Metalink has just come out with Note 401749.1 to autocalc recommended hugepages number
fyi
i don’t think it drops dirty blocks, so not syncing is safe, it just won’t cleanup all possible caches.
Please check Metalink Note 361323.1. One lesson I learned while setting hugepages is to make sure your database is up and running before you run the script, hugepages_settings.sh.