Archive Page 15

One For Your Calendar

Larry Ellison to Announce Sun Oracle Database Machine Specialized for OLTP.

World-Record TPC-H Result Proves Oracle Exadata Storage Server is 10x Faster Than Conventional Storage on a Per-Disk Basis?

BLOG UPDATE (02-Feb-2010): This post has confused some readers. I make mention in this post how Exadata Storage Server does not cache data. Please remember that the topic of this post is an audited TPC-H result that used Version 1 Exadata Storage Server cells. Version 2 Exadata Storage Server is the first release that caches data (the read-only Exadata Smart Flash Cache).

I’d like to tackle a couple of the questions that have  come at me from blog readers about this benchmark:

Kevin, I saw the 1TB TPCH benchmark number. It is very huge. You say Exadata does not cache data so how can it get such result?

True, I do say Exadata does not cache data. It doesn’t. Well, there is a .5 GB write cache in each cell, but that doesn’t have anything to do with this benchmark. This was an in-memory Parallel Query benchmark result. The SGA was used to cache the tables and indexes. That doesn’t mean there was no physical I/O (e.g., sort spilling, etc), but the audited runs were not a proof-point for scanning tables or indexes with offload processing.

Under The Covers
There were 6 HP Oracle Exadata Storage Servers (cells) in the configuration. Regular readers therefore know that there is no more than 6 GB up-wind bandwidth regardless of whether or not the data is cached in the cells. The database grid in this benchmark had 512 Xeon 5400 processor cores. I assure you all that 6 GB/s cannot properly feed 512 of such processor cores since that is only 12 MBPS/core.

Let me just point out that this result is with Oracle Database 11g Release 2 on a 64-node database grid with an aggregate memory capacity of roughly 2TB. The email continued:

I guess this prove Oracle with Exadata is 10x faster?

I presume the reader was referring to the  result in the prior Oracle Database 11g 1TB TPC-H with conventional storage. Folks, Exadata can be 10x faster than Oracle on state of the art conventional storage (generally misconfigured, poorly provisioned, etc). No argument here. But, honestly, I can’t sit here and tell you that 6 Exadata cells with 72 disks is 10x faster than 768 15K RPM drives connected via 128 4Gb Fibre Channel ports used in the prior Oracle 1TB result since that is about 50 GB/s theoretical I/O bandwidth. If you investigate that prior Oracle Database 11g 1TB TPC-H result you’ll see that it was configured with less than 20% of the RAM used by the new Oracle Database 11g Release 2 result (2080 GB aggregate vs 384 GB).

So, what’s my point?

This new world-record is a testimonial to the scalability of Real Application Clusters for concurrent, warehouse-style queries. As much as I’d love to lay claim to the victory on behalf of Exadata, I have to point out, in fairness, that in spite of playing a role in this benchmark the result cannot be attributed to the I/O capability of Exadata.

In short, there is no magic in Exadata that makes 6 12-disk storage cells (72 drives) more I/O capable than 768 drives attached via 128 dual-port 4GFC HBAs.

I’m just comparing one Oracle Database 11g result to another Oracle Database 11g result to answer some blog readers’ questions.

So, no, Exadata is not 10x faster on a per-disk basis. Data comes off of round-brown spinning thingies at the same rate when downwind of Oracle via Exadata or Fibre Channel.  The common problem with conventional storage is the plumbing.  Balancing the producer-consumer relationship between storage and an Oracle Database grid with conventional storage even at the rate produced by a measly 6 Exadata Storage Server cells can be a difficult task. Consider, for example, that one would require a minimum of 15 active 4GFC host bus adapters to deal with 6GB/s. Grid plumbing requires redundancy so one would require and additional 15 4GFC paths through different ports and a different switch in order to architect around single points of failure. I’ve lived prior lives rife with FC SAN headaches and I can attest that working out 30 FC paths can be a real headache.

Using Linux /proc To Identify ORACLE_HOME and Instance Trace Directories.

I recently had a co-worker access one of my systems running Oracle Database 11g. He needed to poke around with focus on an area that he specialized in. After getting him access to the server I got an email from him asking where the trace files are for the instance he was investigating.

This is one of those Carry On Wayward Googler™  sort of posts. Most of you will know this, but it may help someone someday. It did help my co-worker as this was the way I answered his question.

You can find out a lot about an instance without even knowing which ORACLE_HOME it is executing out of by spelunking about in /proc. In the following text box you’ll see how to find the ORACLE_HOME and trace directories for an instance by looking at /proc/<PID>/fd and /proc/<PID>/exe of the LGWR process. This box had an instance called test and an ASM instance. So in this case the ORACLE_HOME values were /u01/app/oracle/product/11.2.0/dbhome_1 and /u01/app/11.2.0/grid.


$ ps -ef | grep lgwr | grep -v grep
oracle    3548     1  0 Sep02 ?        00:00:27 ora_lgwr_test3
oracle    8734     1  0 Sep02 ?        00:00:00 asm_lgwr_+ASM3
$
$ ls -l /proc/8734/exe /proc/3548/exe
lrwxrwxrwx 1 oracle oinstall 0 Sep  2 21:09 /proc/3548/exe -> /u01/app/oracle/product/11.2.0/dbhome_1/bin/oracle
lrwxrwxrwx 1 oracle oinstall 0 Sep  2 11:09 /proc/8734/exe -> /u01/app/11.2.0/grid/bin/oracle
$
$ ls -l /proc/8734/fd  /proc/3548/fd | grep trace | grep -v grep
l-wx------ 1 oracle oinstall 64 Sep  2 21:09 11 -> /u01/app/oracle/diag/rdbms/test/test3/trace/test3_ora_3501.trc
l-wx------ 1 oracle oinstall 64 Sep  2 21:09 12 -> /u01/app/oracle/diag/rdbms/test/test3/trace/test3_ora_3501.trm
l-wx------ 1 oracle oinstall 64 Sep  2 11:09 16 -> /u01/app/oracle/diag/asm/+asm/+ASM3/trace/+ASM3_ora_8636.trc
l-wx------ 1 oracle oinstall 64 Sep  2 11:09 17 -> /u01/app/oracle/diag/asm/+asm/+ASM3/trace/+ASM3_ora_8636.trm

Intel Xeon 5500 Nehalem: Is It 17 Percent Or 2.75-Fold Faster Than Xeon 5400 Harpertown? Well, Yes Of Course It Is!

I received two related emails while I was out recently for a couple of days of fishing and hiking. I thought they’d make for an interesting blog entry. The first email read:

…our tests show very little performance improvement on nehalem cpus compared to older Xeon…

And, the other email was the polar opposite:

…in most of our tests the Xeon 5500 was over 2 times as fast as the harpertown Xeon…

And the email continued:

…so we think you should stop saying that Xeon 5500 is double the perf of older xeon

Well, I can’t make everyone happy. I tend to say that Intel Xeon 5500 (Nehalem) processors are twice as fast as Harpertown Xeon (5400) as a conservative, well-rounded way to set expectations.

Introducing Fat and Skinny
OK, bear with me now, this is a wee tongue-in-cheek. The reader who emailed me with the report of near parity between Nehalem and Xeon is not lying, he’s just skinny. And the reader who admonished me for my usual low-ball citation of 2x performance vis a vis Nehalem versus Harpertown? No, he’s not lying either…he’s fat. Allow me to explain.

It’s really quite simple. If you run code that spends a significant portion of processor cycles operating on memory lines in the processor cache, you are operating code that has a very low CPI (cycles per instruction) cost. In my terminology such code is “skinny.” On the other hand code that jumps around in memory causing processor stalls for memory loads has a high CPI and is, in my terminology, fat.

Skinny code more or less relegates the comparison between Harpertown and Nehalem to one of clock frequency whereas fat code is really where the rubber hits the road. The more load and store hungry (fat) the code is the more the Nehalem pay-off will be.

Let’s take a look at two different, simple programs to help make the point. Using fat.c and skinny.c I’ll take timings on a Harpertown and Nehalem based boxes. As you can see, skinny.c simply hammers away on the same variable and does not leave L2 cache. On the other hand, fat.c treats its memory allocation as an array of 8-byte longs and skips to every 8th one in a loop in order to force memory loads since the cache line size on this box is 64 bytes. NOTE: do not compile these with -O (or change the longs in the array to volatile long). A simple gcc without args will suffice.

So, skinny.c has a very low CPI and fat.c has a very high CPI.

In the following examples, the model name field from cpuid output tells us what each system is. The E5430 is Harpertown Xeon and the 5570 is of course Nehalem. In terms of clock frequency, the Nehalem processors are 10% faster than the Harpertown Xeons.

In the following box you’ll see screen-scrapes I took from two different systems, one based on Nehalem and the other Harpertown. Notice how skinny only improves by 17% with the same executable on Nehalem compared to Harpertown.


# cat /proc/cpuinfo | grep 'model name'
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
# md5sum skinny
df86d9a278ea33b7da853d7a17afdd46  skinny

# time ./skinny

real    6m3.658s
user    6m3.567s
sys     0m0.001s
#

# cat /proc/cpuinfo | grep 'model name'
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
# md5sum skinny
df86d9a278ea33b7da853d7a17afdd46  skinny
# time ./skinny

real    5m1.941s
user    5m2.043s
sys     0m0.001s

In the next box you’ll see screen-scrapes from the same two systems where I ran the “fat” executable. Notice how the Harpertown Xeon took 2.75x longer to process the fat.


# cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
# md5sum fat
b717640846839413c87aedd708e8ac0d  fat
# time ./fat

real    1m57.731s
user    1m57.659s
sys     0m0.045s

# cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
# md5sum fat
b717640846839413c87aedd708e8ac0d  fat
# time ./fat

real    0m42.834s
user    0m42.803s
sys     0m0.023s

So, as it turns out, we can believe both of the folks that sent me email on the matter.

Oracle Switches To Columnar Store Technology With Oracle Database 11g Release 2?

I see fellow OakTable Network member Tanel Poder has blogged that Oracle Database 11g Release 2 has switched to offer columnar store technology. Or at least one could infer that from the post.

I left a comment on Tanel’s blog but would like to make a quick entry here on the topic as well. Oracle Database 11g Release 2 does not offer column-store technology as thought of in the Vertica (for example) sense. The technology, available only with Exadata storage, is called Hybrid Columnar Compression. The word hybrid is important.

Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.

So, “hybrid” is the word. But, none of that matters as much as the effectiveness. This form of compression is extremely effective.

Less Blogging? A Mixed Blessing.

I have been down in the foxhole (lab work) non-stop and haven’t come up for air except to take a few short days off. Some of you may enjoy photos of one of the desert sunsets I took in after a full day of fishing… ah, sighs of relief!

It was a true pleasure watching this sunset go from ice to fire. It’s hard to beat desert sunsets…

IMG_3245

IMG_3252

IMG_3257

IMG_3261

Intel Xeon 5500 (Nehalem EP) NUMA Versus Interleaved Memory (aka SUMA): There Is No Difference! A Forced Confession.

I received an interesting email recently from a reader that takes offense at how I dare to discuss the differences between Intel Xeon 5500 (Nehalem) systems operating in NUMA versus SUMA/SUMO mode. One excerpt of the email read:

…and I think you are just creating confusion and chaos to gain popularity with your NUMA versus non-NUMA stuff. We tested everything we can think of and see no difference when booted with NUMA or non-NUMA…

I don’t doubt for one moment that the testing performed by this reader showed no performance differences between NUMA and SUMA because I have no idea whatsoever what his testing consisted of. And, besides, Xeon 5500 Nehalem EP is one extremely nice NUMA package. That is, when running non-NUMA aware software on this particular NUMA offering you can rest assured that you won’t likely fall over dead from NUMA pathologies. That’s good, but does that mean there really is no difference when booted in the NUMA versus SUMA? Hardly!

Please allow me to explain something. Intel Xeon 5500 (Nehalem) is a very tightly coupled NUMA system. Remote memory references are only about 20% more costly than local. If you measure a workload that does not saturate the processors you are very unlikely to detect any difference in throughput. If you have a program that only drives a processor core to, say, 80% utilization you will most likely not see any throughput difference if the process performs all its I/O into remote memory or local memory. When using only remote memory the process would consume moderately more processor cycles, however unless the code is overly-synthetic so as to force a high rate of L2 misses the result would likely be equivalent throughput in both the local and remote cases.

NUMA/SUMA: The Ever-Hypothetical Topic
Let’s stop talking in the hypothetical. How about something that, gasp, real Oracle Database Administrators have to do more than just occasionally.  Consider for a moment transferring a sizable zipped ASCII file in preparation for loading into an Oracle Data Warehouse. When booting in the default NUMA mode and running Linux, memory is presented to processes in multiple hierarchies. For example, the following box shows a freshly booted Intel Xeon 5500 (Nehalem EP) box with 16 GB total RAM segmented into two memories. Notice how just 7 minutes after booting up memory has been consumed in a non-symmetrical fashion. The numactl command shows that roughly 40% more memory has been allocated from node 0 memory compared to node 1. That’s because not every memory usage in the Linux kernel (including drivers) is NUMA aware. But that is not what I’m blogging about.

# uptime;numactl --hardware
 13:28:30 up 7 min,  1 user,  load average: 0.00, 0.09, 0.07
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 5773 MB
node 1 size: 8080 MB
node 1 free: 7955 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
# cat /proc/meminfo
MemTotal:     16427752 kB
MemFree:      14059424 kB
Buffers:         19588 kB
Cached:         239480 kB
SwapCached:          0 kB
Active:          66308 kB
Inactive:       217152 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     16427752 kB
LowFree:      14059424 kB
SwapTotal:     2097016 kB
SwapFree:      2097016 kB
Dirty:            1848 kB
Writeback:           0 kB
AnonPages:       24408 kB
Mapped:          15024 kB
Slab:           170920 kB
PageTables:       3512 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  10310892 kB
Committed_AS:   382752 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    381716 kB
VmallocChunk: 34359356623 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
# free
             total       used       free     shared    buffers     cached
Mem:      16427752    2370064   14057688          0      19872     239852
-/+ buffers/cache:    2110340   14317412
Swap:      2097016          0    2097016

In this section of this blog entry I’d like to show a practical example of honest-to-goodness, real world work that doesn’t exhibit totally benign NUMA characteristics. Within a VNC I opened two xterm sessions. I’ll call them “left” and “right.” In the left xterm I’ll list a zipped ASCII file to capture the inode so as to prove my testing is happening against the same file. The file is inode 1701506. You’ll also see a stupid little script called henny_penny.sh named appropriately as I apparently come off as Henny Penny to folks like the reader who emailed me. The henny_penny.sh script executed in the left xterm showed that a shell with a parent process id of 23283 was able to sling the contents of all_card_trans.ul.gz into /dev/null at the rate of 4.9 GB/s. That is very fast indeed. It is that fast, in fact, because the file has been moved into the current directory with FTP so the contents of the approximately 1.5 GB file is cached in memory. Ah, but the question is, what memory?

# ls -li all* henny_penny*
1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz
1701513 -rwxr-xr-x 1 root root         90 Aug 14 12:17 henny_penny.sh
# cat henny_penny.sh
ps -f
ls -li all_card_trans.ul.gz
date
dd if=all_card_trans.ul.gz of=/dev/null bs=1M
date
# sh ./henny_penny.sh
UID        PID  PPID  C STIME TTY          TIME CMD
root     23283 23280  0 12:13 pts/0    00:00:00 -bash
root     23849 23283  0 12:18 pts/0    00:00:00 sh ./henny_penny.sh
root     23850 23849  0 12:18 pts/0    00:00:00 ps -f
1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz
Fri Aug 14 12:18:12 PDT 2009
1403+1 records in
1403+1 records out
1472114768 bytes (1.5 GB) copied, 0.30021 seconds, 4.9 GB/s
Fri Aug 14 12:18:12 PDT 2009

In the following box you’ll see how things behaved in the right xterm. I invoked henny_penny.sh (parent PID 23422) and voila dd(1) was able to shovel the contents of all_card_trans.ul.gz into /dev/null at a rate of 6 GB/s. Now, that’s only 22% faster for a totally memory-bound, CPU-saturated task so why would anyone other than Henny Penny care? Notice how the henny_penny.sh script included the output of the date(1) command. Just three seconds after “left” was muddling through at 4.9 GB/s, “right” proceeded to  rip through at 6.0 GB/s. Yes, memory hierarchy matters.

# sh ./henny_penny.sh
UID        PID  PPID  C STIME TTY          TIME CMD
root     23422 23420  0 12:14 pts/3    00:00:00 -bash
root     23856 23422  0 12:18 pts/3    00:00:00 sh ./henny_penny.sh
root     23857 23856  0 12:18 pts/3    00:00:00 ps -f
1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz
Fri Aug 14 12:18:15 PDT 2009
1403+1 records in
1403+1 records out
1472114768 bytes (1.5 GB) copied, 0.244703 seconds, 6.0 GB/s
Fri Aug 14 12:18:15 PDT 2009

How, What, Why?
The left xterm and its children happen to be executing on cores 0-3 (SMT disabled at the moment but no matter) and the right xterm on cores 4-7. The FTP process executed on one (or more) of cores 4-7 and since Linux prefers to allocate buffers to a process such as this from local memory, you can see why henny_penny.sh in the right xterm achieved the throughput it did.

Who Cares?

Likely nobody until the Xeon 5500 Linux production uptake actually starts! In the meantime there is me (Henny Penny) and a few curiously morbid (er, uh, morbidly curious) Googlers who might stumble upon this trivia.

What’s This Have To Do With Nehalem EX?
Well, even the 4-socket Nehalem EX packaging implements single-hop remote memory. That’s a significant difference from the way 4-sockets were done with HyperTransport. So, I actually don’t expect NUMAisms such as this to be any more painful than with EP (2 socket).

I Still Think He’s Henny Penny
So, let’s take another look at this topic. I’ve already mentioned that Linux likes to allocate memory close to processes when running on Nehalem systems. That’s good, isn’t it? Well, the answer is yes, of course, it depends.

In the following text box you’ll see how I depleted free memory (down to 40MB free) from node 0 by writing zeros to a file. Consider yet another hypothetical with me for one moment. What happens when I execute, say, 100 processes that each allocates a moderate 16 MB of memory with malloc(3)?  Do you think Linux will yank these processes from me, their parent, and place them on node 1 or will they be homed on node 0 with their heaps allocated from node 1? Will it matter? What if they are producers and I am their consumer? Where should they execute? What if they each work on 1/100th of the dumb_test.out file reading into their respective heap? Well, at this point there is no way for 100 processes on node 0 (socket 0) to attack 1/100th segments (buffering in their heap) of that file without 100% remote memory overhead. Could such a “bizarre” hypothetical happen in production? Sure. Is there any way to properly deal with such an issue? Well, yes and no.

If the hypothetical “1/100th program” was coded to libnuma then it can assure process placement and therefore local heap. However, what about the fact that my work file is buffered entirely on node 0 memory? Wouldn’t that guarantee 100% local access to node 0 users of that file but 100% remote for node 1 users? Yes. That’s great for the node 0 users you might say. However, those node 0 users had better not malloc(3) any memory because you know where that memory is going to come from. ‘Round and ’round we go…

# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 5946 MB
node 1 size: 8080 MB
node 1 free: 7987 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
# time dd if=/dev/zero of=dumb_test.out bs=1M count=5946;numactl --hardware
5946+0 records in
5946+0 records out
6234832896 bytes (6.2 GB) copied, 6.07315 seconds, 1.0 GB/s

real    0m6.091s
user    0m0.003s
sys     0m6.069s
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 40 MB
node 1 size: 8080 MB
node 1 free: 7652 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

So, what if I cloak my test with libnuma attributes (inherited by dd from numactl(8))?  In the following text box you’ll see that instead of a Cyclops, memory was allocated nice and evenly from the page cache when I filled out the dumb.test.out file. So in this model, processes homed on either node 0 or node 1 are guaranteed a 50% local access rate when accessing dumb_test.out and I am protected from memory imbalances. In fact, if it was my system and had to stay with NUMA, I’d consider invoking shells under numactl –interleave. As such any non-NUMA aware programs (like FTP) will be granted memory in a round-robin fashion but any NUMA aware program (coded to libnuma calls) will execute as it would without being wrapped with numactl. It’s just a thought. It isn’t any official recommendation and, as my email in-box suggests, it doesn’t matter anyway…nonetheless, I think the following looks better than a cyclops:

# numactl --interleave=0,1 /bin/bash
# numactl -s
policy: interleave
preferred node: 0 (interleave next)
interleavemask: 0 1
interleavenode: 0
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1
# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 5957 MB
node 1 size: 8080 MB
node 1 free: 7988 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
# dd if=/dev/zero of=dumb_test.out bs=1M count=5957
5957+0 records in
5957+0 records out
6246367232 bytes (6.2 GB) copied, 6.24962 seconds, 999 MB/s
# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 2825 MB
node 1 size: 8080 MB
node 1 free: 4854 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

Quantifying Hugepages Memory Savings with Oracle Database 11g

In my recent post about physical memory consumed by page tables when hugepages are not in use I showed an example of 500 dedicated connections to an Oracle Database 11g instance with 8000M SGA consuming roughly 7 gigabytes of physical memory just for page tables. A reader emailed me to point out that it would be informational to show page table consumption with hugepages employed. How right.

The following script was running while I invoked sqlplus 500 times to connect to the same instance discussed in this post. As the script shows, page table cost peaked at roughly 265 MB compared to the 7 GB lost to page tables in the non-hugepages case.

I’m scratching my head and thinking of what else could possibly be said on the matter…

$ while true
> do
> ps -ef | grep oracletest | wc -l
> grep PageTables /proc/meminfo
> sleep 30
> done
1
PageTables:      23484 kB
501
PageTables:     264432 kB
501
PageTables:     264900 kB
501
PageTables:     265400 kB
501
PageTables:     265944 kB
130
PageTables:      83112 kB
120
PageTables:      78672 kB
110
PageTables:      74188 kB
100
PageTables:      69712 kB
90
PageTables:      65264 kB
80
PageTables:      60804 kB
70
PageTables:      56332 kB
60
PageTables:      51872 kB
50
PageTables:      47376 kB
40
PageTables:      42848 kB
30
PageTables:      37904 kB
20
PageTables:      32976 kB
10
PageTables:      28024 kB
1
PageTables:      23532 kB
1
PageTables:      23528 kB

Little Things Doth Crabby Make – Part X. Posts About Linux Hugepages Makes Some Crabby It Seems. Also, Words About Sizing Hugepages.

I received a few pieces of (not)fan-mail about my latest post in the Crabby Series. One reader took offense at the fact that I bother to blog about hugepages because, in his words:

…you insult the intelligence of your readers. You know full well everyone uses hugepages

Is that why Metalink Note 749851.1 goes to the trouble of advising DBAs that the default database setup from Database Configuration Assistant (DBCA) configures Automatic Memory Management which does not use hugepages?

I assure you, not everyone uses hugepages and part of that is because it can be difficult to set it up if you have several databases—especially if your databases have a mix of heavy PGA usage and heavy SGA usages. Also, if your calculations are off and there are insufficient hugepages to cover the SGA, Oracle will go ahead and allocate with a shmget() that doesn’t pass in SHM_HUGETLB. The effect of that little twist is you’ll be “missing” the memory that was carved out for hugepages and the SGA will reside in other non-hugepages memory. So, for instance, if you calculate your SGA to be 1GB and you allocate 513 (1GB + 1 page for wiggle room) but your SGA turns out to be 1073758208 (1GB + 16KB), you’ll get a non-hugepages SGA and eventually there will be roughly 2GB tied up. I think it is an important topic.

Metalink 401749.1
Oracle Support offers a script to assist DBAs in calculating hugepages requirement. With all your instances up, run the script and it will calculate a setting for you. The note is entitled Shell Script to Calculate Values Recommended HugePages / HugeTLB Configuration.

There is a small nit regarding this note ( the procedure it involves actually).  In order for the script to give you a recommendation, you have to revert from AMM first, then do a boot of your instances with MMM so it can peek what SysV IPC segments are being allocated for the instances. So, it’s a multi-step process. I suppose with a lot of extra thought the same thing could be calculated by tallying up all the “granule files” found in /dev/shm under AMM, but no matter. This is fairly simple.

Let’s look at my system. Here’s what we’ll see:

First, we’ll see how large the SGA really is.

Next, we’ll see how large of an IPC segment the instance called for. In my case it is about 37MB larger than the actual SGA. That’s fine.

Finally we’ll see the output of the hugepages_settings.sh script to see what it advises.

SQL*Plus: Release 11.X.0.X.0 Production on Mon Jul 27 13:35:41 2009

Copyright (c) 1982, 2009, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.X.0.X.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> show sga

Total System Global Area 8351150080 bytes
Fixed Size                  2214808 bytes
Variable Size            1543505000 bytes
Database Buffers         6777995264 bytes
Redo Buffers               27435008 bytes
SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.X.0.X.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x522d5fd4 327681     oracle    660        8390705152 53                      

$ sh ./hugepages_setting.sh
Recommended setting: vm.nr_hugepages = 4003
$ grep Huge /proc/meminfo
HugePages_Total:  5000
HugePages_Free:    999
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

So it looks like the script is accurate and even allows a little wiggle room. That’s good. I think this script (being helpful) combined with a healthy fear of the nastiness in large SGA+large dedicated connection deployments (without hugepages) should get us all one step closer to insisting on hugepages backing for our SGAs.

Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g.

Recently I had someone ask me in email why I bother posting installments on my Little Things Doth Crabby Make series. I responded by saying I think it is valuable to IT professionals to know they are not alone when confronted by something that makes little sense, or makes them crabby if that be the case. It’s all about the Wayward Googler(tm).

Well, Wayward Googler, it’s coming on thick.

Using Memory and Then Allocating HugePages (Or Die Trying)
I purposefully booted my system with no hugepages allocated in /etc/sysctl.conf (vm.nr_hugepages = 0). I then booted an Oracle Database 11g instance with sga_target set to 8000M. Next, I fired off 500 dedicated connections using the following goofy stuff:


$ cat doit
cnt=0
until [ $cnt -eq 500 ]
do
   sqlplus rw/rw @foo.sql &
   (( cnt = $cnt + 1 ))
done

wait

$ cat foo.sql
HOST sleep 120
exit;

The script ran in a matter of moments since I’m using a Xeon 5500 (Nehalem) based dual-socket server running Linux with a 2.6 kernel. Yes, these processors are really, really fast. But that, of course, isn’t what made me crabby.

Directly before I invoked the script, that fired off my 500 dedicated connections,  I executed a script that intermittently peeked at how much memory is being wasted on page tables. Remember, without hugepages (hugetlb) backed IPC Shared Memory for the SGA there will be page table overhead for every connection to the instance. The size of the SGA and the number of dedicated connections compounds to consume potentially significant amounts of memory. Although that is also not what made me crabby, let’s look at what 500 dedicated sessions attaching to an 8000 MB SGA looks like as the user count ramps up:


$ while true
> do
> grep PageTables /proc/meminfo
> sleep 10
> done

PageTables:       3764 kB
PageTables:       4696 kB
PageTables:      65848 kB
PageTables:     176956 kB
PageTables:     287616 kB
PageTables:     366540 kB
PageTables:     478224 kB
PageTables:     588424 kB
PageTables:     699832 kB
PageTables:     792356 kB
PageTables:     802468 kB
PageTables:     834004 kB
PageTables:     851980 kB
PageTables:     835432 kB
PageTables:     834948 kB
PageTables:     835052 kB
PageTables:    1463260 kB
PageTables:    2072864 kB
PageTables:    2679572 kB
PageTables:    3283456 kB
PageTables:    3892628 kB
PageTables:    4496868 kB
PageTables:    5100908 kB
PageTables:    6846256 kB
PageTables:    6866820 kB
PageTables:    6829388 kB
PageTables:    6874752 kB
PageTables:    6879360 kB
PageTables:    6883076 kB
PageTables:    6895244 kB
PageTables:    6901528 kB
PageTables:    6917256 kB
PageTables:    6927984 kB
PageTables:    6999196 kB
PageTables:    6999472 kB
PageTables:    7000048 kB
PageTables:    7088160 kB
PageTables:    7087960 kB
PageTables:    7088812 kB
PageTables:    7132804 kB
PageTables:    7121120 kB

Got Spare Memory? Good, Don’t Use Hugepages
Uh, just short of 7 GB of physical memory lost to page tables! That’s ugly, but that’s not what made me crabby. Before I forget, did I mention that it is a really good idea to back your SGA with hugepages if you are running a lot of dedicated connections and have a large SGA?

So, What Did Make Him Crabby Anyway?
Wasting all that physical memory with page tables was just part of some analysis I’m doing. I never aim to waste memory (nor processor cycles for TLB misses) like that. So, I shut my Oracle Database 11g instance down in order to implement hugepages and move on. This is where I started getting crabby.

The first thing I did was verify there were, in fact, no allocated hugepages. Next, I checked to see if I had enough free memory to mess with. In this case I had most of the 16GB physical memory free. So, I tried to allocate 6200 2MB hugepages by echoing the token into /proc.  Finally, I checked to make sure I was granted the hugepages I requested…Irk. Now that, made me crabby. Instead of 6200 I was given what appears to be some random number someone pulled out of the clothes hamper—604 hugepages:

# grep HugePages /proc/meminfo
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
# free
             total       used       free     shared    buffers     cached
Mem:      16427876     422408   16005468          0      24104     209060
-/+ buffers/cache:     189244   16238632
Swap:      2097016      29836    2067180
# echo 6200 > /proc/sys/vm/nr_hugepages
# grep HugePages /proc/meminfo
HugePages_Total:   604
HugePages_Free:    604
HugePages_Rsvd:      0

So, I then checked to see what free memory looked like:

# free
             total       used       free     shared    buffers     cached
Mem:      16427876    1670400   14757476          0      27040     207924
-/+ buffers/cache:    1435436   14992440
Swap:      2097016      29696    2067320

Clearly I was granted that oddball 604 hugepages I didn’t ask for. Maybe I’m supposed to just take what I’m given and be happy?

Please Sir, May I Have Some More?

I thought, perhaps the system just didn’t hear me clearly. So, without changing anything I just belligerently repeated my command and found that doing so increased my allocated hugepages by a whopping 2:

# echo 6200 > /proc/sys/vm/nr_hugepages
# grep HugePages /proc/meminfo
HugePages_Total:   608
HugePages_Free:    608
HugePages_Rsvd:      0

I began to wonder if there was some reason 6200 was throwing the system a curve-ball. Here’s what happened when I lowered my expectations by requesting 3100:

# echo 3100 > /proc/sys/vm/nr_hugepages;grep HugePages /proc/meminfo
HugePages_Total:   610
HugePages_Free:    610
HugePages_Rsvd:      0

Great. I began to wonder how long I could continually whack my head against the wall picking up little bits and pieces of hugepages along the way. So, I scripted 1000 consecutive requests for hugepages. I thought, perhaps, it was necessary to really, really want those hugepages:

# cnt=0;until [ $cnt -eq 1000 ]
> do
> echo 6200 > /proc/sys/vm/nr_hugepages
> (( cnt = $cnt + 1 ))
> done
# grep HugePages /proc/meminfo
HugePages_Total:  5502
HugePages_Free:   5502
HugePages_Rsvd:      0

Brilliant! Somewhere along the way the system decided to start doling out more than those piddly 2-page allocations in response to my request for 6200, otherwise I would have exited this loop with 2,610 hugepages. Instead, I exited the loop with 5502.

Well, since some is good, more must be better. I decided to run that stupid loop again just to see if I could pick up any more crumbs:

# cnt=0;until [ $cnt -eq 1000 ]; do echo 6200 > /proc/sys/vm/nr_hugepages; (( cnt = $cnt + 1 )); done
# grep PageTables /proc/meminfo
PageTables:       7472 kB
# grep '^Hu' /proc/meminfo
HugePages_Total:  5742
HugePages_Free:   5742
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

That makes me crabby.

Summary:
We should all do ourselves a favor and make sure we boot our servers with sufficient hugepages to cover our SGA(s). And, of course, you don’t get hugepages if you use Automatic Memory Management.

Little Things Doth Crabby Make – Part VIII. Hugepage Support for Oracle Database 11g Sometimes Means Using The ipcrm Command. Ugh.

Not that anyone should care about the things that make me crabby, but…here comes another brief post in my Little Things Doth Crabby Make series.

In the following box, you’ll see how I was just simply trying to remove a wee bit of detritus, specifically a segment of SysV IPC shared memory. So, here’s how this all transpired:

  • I used the ipcs command to get the shmid (262145).
  • I then fat-fingered a typo and tried to remove shmid 262146
  • Having realized what I did I immediately satisfied my curious morbidity, er, I mean morbid curiosity and checked the return code from the ipcs command. Oddly ipcs reported that it was perfectly happy to not remove a non-existent segment. But that’s not entirely what made me crabby.
  • I then issued the command without the typo.
  • Next (as fast as I could type) I checked to see what segments remained. That’s where I started to get even crabbier.
  • Since it seemed I was facing some odd stubbornness, I decided to issue the “old school” style command (i.e., the shm argument/option pair). The command failed but since that style of command is documented as deprecated I simply thought I was getting deprecated functionality.
  • Finally, I ran ipcs once gain to find that the segment was gone.
# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      600        189        1          dest
0x522d5fd4 262145     oracle    660        6444548096 14                      

# ipcrm -m 262146
# echo $?
0
# ipcrm -m 262145
# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      600        189        1          dest
0x00000000 262145     oracle    660        6444548096 14         dest 

# ipcrm shm 262145
cannot remove id 262145 (Invalid argument)
# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      600        189        1          dest         

Forget the oddball situation with the return code for a moment. What I just discovered is that the work behind the ipcrm command that clears down the memory is asynchronous functionality. You all may have known that for all time, but I didn’t. Or, at least I don’t remember forgetting that fact if I did know it at one time.

It turns out the very first ipcrm –m 262145 was in the process of succeeding. That’s why my deprecated command usage was answered truthfully with EINVAL. The segment was gone, I was just being impatient.

Disclaimer
I reserve the right to remain crabby about the success code returned after my failed attempt to remove a segment that didn’t exist.

Hold it. Why would anyone care about SysV IPC shared memory where the Linux ports of Oracle Database 11g are concerned? After all, Automatic Memory Management is implemented via memory mapped files.

Summary
Don’t script against the return code of the Linux ipcrm command. It might make you crabby.

Oracle Exadata Storage Server Technical Deep Dive Series – Part II: Requires a Citrix CODEC.

Several people have pointed out that Part II in my Oracle Exadata Storage Server Technical Deep Dive Series would not play back on their computers for lack of a codec.  That is true and I didn’t know that when I tested the uploaded version because I have that particular codec installed on my system. The required bits are produced by Citrix Online. I have updated my webcast index page with more information for obtaining the required codec.

This issue only related to Part II in the series.

For what it’s worth, the codec is easily uninstalled through control panel->Add Remove Software.

Oracle Exadata Storage Server Architecture: Impossible To Back Up?

Too Large To Back Up?
If it is possible to back up a large data warehouse at rates of over 11 TB/h for full backup and more than 100 TB/h for incremental backups, maybe not!

The Maximum Availability Architecture (MAA) team has just published a paper covering tape backup of the HP Oracle Database Machine. The conclusion reads:

With the Exadata Storage Server, Oracle provides an architecture that allows customers with large databases to scale their tape backup to any desired peformance level. The number and connectivity of media servers, and the number and speed of tape drives will define the performance limit of backup, not the Database Machine. With two media servers, effective full backup rates from 11.2 TB/hour and effective incremental backup rate of over 104 TB/hour were achieved.

The title of the paper is Tape Backup Performance and Best Practices for Exadata Storage and the HP Oracle Database Machine and the paper can be accessed at the following link:

Tape Backup Performance and Best Practices for Exadata Storage and the HP Oracle Database Machine

Oracle Database File System (DBFS) on Exadata Storage Server. Hidden Content?

A colleague of mine in Oracle’s Real-World Performance Group just pointed out to me that the link (on my Papers, Webcasts, etc page) to the archived webcast of Part IV in my Oracle Exadata Storage Server Technical Deep Dive Series was stale. Actually, the problem turns out that I mistakenly set the file to expire after a fixed number of downloads. I didn’t think it would get downloaded 500 times but it seems I was wrong.

I just fixed it,  so if you’ve tried to get Part IV and hit this problem also, please give it a go now. For those of you who don’t know about this series, please visit my Papers, Webcasts, etc page where I have posted a description of each archived webcast.

Where’s Part III?

I am still on the hook for doing Part III again to recover from the loss of the IOUG recording. Part III goes into the aspect of Exadata architecture known as the division of work. Understanding the division of work is important for folks trying to decide what sort of configuration they’d have to assemble to match the performance of an Exadata deployment (e.g., HP Oracle Database Machine). I’ve been very interrupt-driven lately so I have not been able to do the Part III over again. I’ll post a blog entry when it is available.

Aren’t Customers Choosing Oracle Database Machine?

This is just a quick blog entry to point to the first heavily customer-focused news release about the Oracle Database Machine (based on Oracle Exadata Storage Server).

Here is the link:

Customers are Choosing the Oracle Database Machine


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.