Archive for the 'oracle' Category



With Friends Like That Who Needs Enemies?

James Morle, my longtime friend, OakTable Network Co-founder, and Director of Scale Abilities has noticed I’m blogging about NUMA topics so he thought it was fitting to share a video of another apparent ex-Sequent employee who, like me, seems to have failed a NUMA 12-step program.

Thanks, James. I needed that.

Oracle on Opteron with Linux-The NUMA Angle (Part V). Introducing numactl(8) and SUMA. Is The Oracle x86_64 Linux Port NUMA Aware?

This blog entry is part five in a series. Please visit here for links to the previous installments.

Opteron-Based Servers are NUMA Systems
Or are they? It depends on how you boot them. For instance, I have 2 HP DL585 servers clustered with the PolyServe Database Utility for Oracle RAC. I booted one of the servers as a non-NUMA by tweaking the BIOS so that memory was interleaved on a 4KB basis. This is a memory model HP calls Sufficiently Uniform Memory Access (SUMA) as stated in this DL585 Technology Brief (pg. 6):

Node interleaving (SUMA) breaks memory into 4-KB addressable entities. Addressing starts with address 0 on node 0 and sequentially assigns through address 4095 to node 0, addresses 4096 through 8191 to node 1, addresses 8192 through 12287 to node 3, and addresses 12888

Booting in this fashion essentially turns an HP DL585 into a “flat-memory” SMP—or a SUMA in HP parlance. There seems to be conflicting monikers for using Opteron SMPs in this mode. IBM has a Redbook that covers the varying NUMA offerings in their System x portfolio. The abstract for this Redbook states:

The AMD Opteron implementation is called Sufficiently Uniform Memory Organization (SUMO) and is also a NUMA architecture. In the case of the Opteron, each processor has its own “local” memory with low latency. Every CPU can also access the memory of any other CPU in the system but at longer latency.

Whether it is SUMA or SUMO, the concept is cool, but a bit foreign to me given my NUMA background. The NUMA systems I worked on in the 90s consisted of distinct, separate small systems—each with their own memory and I/O cards, power supplies and so on. They were coupled into a single shared memory image with specialized hardware inserted into the system bus of each little system. These cards were linked together and the whole package was a cache coherent SMP (ccNUMA).

Is SUMA Recommended For Oracle?
Since the HP DL585 can be SUMA/SUMO, I thought I’d give it a test. But first I did a little research to see how most folks use these in the field. I know from the BIOS on my system that you actually get a warning and have to override it when setting up interleaved memory (SUMA). I also noticed that in one of HP’s Oracle Validated Configurations, the following statement is made:

Settings in the server BIOS adjusted to allow memory/node interleaving to work better with the ‘numa=off’ boot option

and:

Boot options
elevator=deadline numa=off

 

I found this to be strange, but I don’t yet fully understand why that recommendation is made. Why did they perform this validation with SUMA? When running a 4-socket Opteron system in SUMA mode, only 25% of all memory accesses will be to local memory. When I say all, I mean all—both user and kernel mode. The Linux 2.6 kernel is NUMA-aware so is seems like a waste to transform a NUMA system into a SUMA system? How can boiling down a NUMA system with interleaving (SUMA) possibly be optimal for Oracle? I will blog about this more as this series continues.

Is the x86_64 Linux Oracle Port NUMA Aware?
No, sorry, it is not. I might as well just come out and say it.

The NUMA API for Linux is very rudimentary compared to the boutique features in legacy NUMA systems like Sequent DYNIX/ptx and SGI IRIX, but it does support memory and process placement. I’ll blog later about this things it is missing that a NUMA aware Oracle port would require.

The Linux 2.6 kernel is NUMA aware, but what is there for applicaitons? The NUMA API which is implemented in the library called libnuma.so. But you don’t have to code to the API to effect NUMA awareness. The major 2.6 Linux kernel distributions (RHEL4 and SLES) ship with a command that uses the NUMA API in ways I’ll show later in this blog entry. The command is numactl(8) and it dynamically links to the NUMA API library (emphasis added by me):

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ type numactl
numactl is hashed (/usr/bin/numactl)
$ ldd /usr/bin/numactl
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003ba3200000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

Whereas the numactl(8) command links with libnuma.so, Oracle does not:

$ type oracle
oracle is /u01/app/oracle/product/10.2.0/db_1/bin/oracle
$ ldd /u01/app/oracle/product/10.2.0/db_1/bin/oracle
libskgxp10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxp10.so (0x0000002a95557000)
libhasgen10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libhasgen10.so (0x0000002a9565a000)
libskgxn2.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxn2.so (0x0000002a9584d000)
libocr10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocr10.so (0x0000002a9594f000)
libocrb10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrb10.so (0x0000002a95ab4000)
libocrutl10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrutl10.so (0x0000002a95bf0000)
libjox10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libjox10.so (0x0000002a95d65000)
libclsra10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libclsra10.so (0x0000002a96830000)
libdbcfg10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libdbcfg10.so (0x0000002a96938000)
libnnz10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libnnz10.so (0x0000002a96a55000)
libaio.so.1 => /usr/lib64/libaio.so.1 (0x0000002a96f15000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003ba3200000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003ba3400000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003ba3800000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003ba7300000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

No Big Deal, Right?
This NUMA stuff must just be a farce then, right? Let’s dig in. First, I’ll use the SLB (http://oaktable.net/getFile/148). Later I’ll move on to what fellow OakTable Network member Anjo Kolk and I refer to as the Jonathan Lewis Oracle Computing Index. The JL Oracle Computing Index is yet another microbenchmark that is very easy to run and compare memory throughput from one server to another using an Oracle workload. I’ll use this next to blog about NUMA effects/affects on a running instance of Oracle. After that I’ll move on to more robust Oracle OLTP and DSS workloads. But first, more SLB.

The SLB on SUMA/SOMA
First, let’s use the numactl(8) command to see what this DL585 looks like. Is it NUMA or SUMA?

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 1 nodes (0-0)
node 0 size: 32767 MB
node 0 free: 30640 MB

OK, this is a single node NUMA—or SUMA since it was booted with memory interleaving on. If it wasn’t for that boot option the command would report memory for all 4 “nodes” (nodes are sockets in the Opteron NUMA world). So, I set up a series of SLB tests as follows:

$ cat example1
echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two Threads, same core”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6

echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, same socket”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, different sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

echo “4 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./trigger
wait

echo “8 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./trigger
wait

And now the measurements:

$ sh ./example1
One thread
Total ops 1572864000 Avg nsec/op 71.5 gettimeofday usec 112433955 TPUT ops/sec 13989225.9
Two threads, same socket
Total ops 1572864000 Avg nsec/op 73.4 gettimeofday usec 115428009 TPUT ops/sec 13626363.4
Total ops 1572864000 Avg nsec/op 74.2 gettimeofday usec 116740373 TPUT ops/sec 13473179.5
Two threads, different sockets
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114759102 TPUT ops/sec 13705788.7
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114853095 TPUT ops/sec 13694572.2
4 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122879394 TPUT ops/sec 12800063.1
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122820373 TPUT ops/sec 12806214.2
Total ops 1572864000 Avg nsec/op 78.2 gettimeofday usec 123016921 TPUT ops/sec 12785753.3
Total ops 1572864000 Avg nsec/op 78.5 gettimeofday usec 123527864 TPUT ops/sec 12732868.1
8 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245773200 TPUT ops/sec 6399656.3
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245848989 TPUT ops/sec 6397683.4
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 245941009 TPUT ops/sec 6395289.7
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 246000176 TPUT ops/sec 6393751.5
Total ops 1572864000 Avg nsec/op 156.6 gettimeofday usec 246262366 TPUT ops/sec 6386944.2
Total ops 1572864000 Avg nsec/op 156.5 gettimeofday usec 246221624 TPUT ops/sec 6388001.1
Total ops 1572864000 Avg nsec/op 156.7 gettimeofday usec 246402465 TPUT ops/sec 6383312.8
Total ops 1572864000 Avg nsec/op 156.8 gettimeofday usec 246594031 TPUT ops/sec 6378353.9

SUMA baselines at 71.5ns average write operation and tops out at about 156ns with 8 concurrent threads of SLB execution (one per core). Let’s see what SLB on NUMA does.

SLB on NUMA
First, let’s get an idea what the memory layout is like:

$ uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 4 nodes (0-3)
node 0 size: 8191 MB
node 0 free: 5526 MB
node 1 size: 8191 MB
node 1 free: 6973 MB
node 2 size: 8191 MB
node 2 free: 7841 MB
node 3 size: 8191 MB
node 3 free: 7707 MB

OK, this means that there is approximately 5.5GB, 6.9GB, 7.8GB and 7.7GB of free memory on “nodes” 0, 1, 2 and 3 respectively. Why is the first node (node 0) lop-sided? I’ll tell you in the next blog entry. Let’s run some SLB. First, I’ll use numactl(8) to invoke memhammer with the directive that forces allocation of memory on a node-local basis. The first test is one memhammer process per socket:

$ cat ./membind_example.4
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ bash ./membind_example.4
Total ops 1572864000 Avg nsec/op 67.5 gettimeofday usec 106113673 TPUT ops/sec 14822444.2
Total ops 1572864000 Avg nsec/op 67.6 gettimeofday usec 106332351 TPUT ops/sec 14791961.1
Total ops 1572864000 Avg nsec/op 68.4 gettimeofday usec 107661537 TPUT ops/sec 14609340.0
Total ops 1572864000 Avg nsec/op 69.7 gettimeofday usec 109591100 TPUT ops/sec 14352114.4

This test is the same as the one above called “4 threads, 4 sockets” performed on the SOMA configuration where the latencies were 78ns. Switching from SOMA to NUMA and executing with NUMA placement brought the latencies down 13% to an average of 68ns. Interesting. Moreover, this test with 4 concurrent memhammer processes actually demonstrates better latencies than the single process average on SUMA which was 72ns. That comparison alone is quite interesting because it makes the point quite clear that SUMA in a 4-socket system is a 75% remote memory configuration—even for a single process like memhammer.

The next test was 2 memhammer processes per socket:

$ more membind_example.8
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ sh ./membind_example.8
Total ops 1572864000 Avg nsec/op 95.8 gettimeofday usec 150674658 TPUT ops/sec 10438809.2
Total ops 1572864000 Avg nsec/op 96.5 gettimeofday usec 151843720 TPUT ops/sec 10358439.6
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152368004 TPUT ops/sec 10322797.2
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152433799 TPUT ops/sec 10318341.5
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152436721 TPUT ops/sec 10318143.7
Total ops 1572864000 Avg nsec/op 97.0 gettimeofday usec 152635902 TPUT ops/sec 10304679.2
Total ops 1572864000 Avg nsec/op 97.2 gettimeofday usec 152819686 TPUT ops/sec 10292286.6
Total ops 1572864000 Avg nsec/op 97.6 gettimeofday usec 153494359 TPUT ops/sec 10247047.6

What’s that? Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API. No, of course an Oracle workload is not all random writes, but a system has to be able to handle the difficult aspects of a workload in order to offer good throughput. I won’t ask the rhetorical question of why Oracle is not NUMA aware in the x86_64 Linux ports until my next blog entry where the measurements will not be based on the SLB, but a real Oracle instance instead.

Déjà vu
Hold it. Didn’t the Dell PS1900 with a Clovertown Xeon quad-core E5320’s exhibit ~500ns latencies with only 4 concurrent threads of SLB execution (1 per core)? That was what was shown in this blog entry. Interesting.

I hope it is becoming clear why NUMA awareness is interesting. NUMA systems offer a great deal of potential incremental bandwidth when local memory is preferred over remote memory.

Next up—comparisons of SUMA versus NUMA with the Jonathan Lewis Computing Index and why all is not lost just because the 10gR2 x86_64 Linux port is not NUMA aware.

Ouch! Migrating an Oracle7 Database from OpenVMS to Oracle10g on Linux.

I’ve been following the thread over at Yet Another Oracle DBA Blog about the effort to migrate a very complex Oracle7 database on OpenVMS (old Digital gear)  to Oracle10g on Linux. I hope Herod will give some details about the hardware used for the Linux system. It looks like they are about 6 days into the data migration aspect of the project. Here is a snippet of the system being migrated from:

Oracle database version 7, with an astounding 3,319 tables in a single schema with only one primary key constraint created on one table. No database referential integrity at all. 891 procedures, 4319 triggers, 771 functions (no packages).

That sounds like a mess! This would be quite a project for anyone to take on. Give the site a visit and see what you think:

http://yaodba.blogspot.com/2007/01/new-project-time-for-new-job.html

http://yaodba.blogspot.com/2007/01/conversion-of-07.html

http://yaodba.blogspot.com/2007/01/conversion-of-07-still-processing.html

Migrate from Windows to Linux. The Stupid Quote of the Day.

While I prefer Linux over Windows for Oracle (purely personal preference), I think this Linux Journal webpage has the Stupid Quote of the Day Award:

The smartest move for anyone to make is to migrate from Windows to Linux.

Techno-Religious fanaticism at its best! Way to go!

What Does This Have to do With Oracle?
As I pointed out in my blog entry about Oracle revenue from Windows deployments, Larry still makes more money from Windows deployments than Linux. Yes, these are CY2005 numbers, we’ll have to see what 2006 looks like. I suspect more of the same honestly. That is, if those numbers are ever revealed.

Windows or Linux for Oracle is a choice that can only be made by each IT shop. If you are a Windows shop, you’ll choose Windows. If you are a traditional Unix shop, and want to play in the commodity space, you’ll go with Linux.

Yes Direct I/O Means Concurrent Writes. Oracle Doesn’t Need Write-Ordering.

If Sir Isaac Newton was walking about today dropping apples to prove his theory of gravity, he’d feel about like I do making this blog entry. The topic? Concurrent writes on file system files with Direct I/O.

A couple of months back, I made a blog entry about BIGFILE tablespaces in ASM versus modern file systems.The controversy at hand at the time was about the dreadful OS locking overhead that must surely be associated with using large files in a file system. I spent a good deal of time tending to that blog entry pointing out that the world is no longer flat and such age-old concerns over OS locking overhead on modern file systems no longer relevant. Modern file systems support Direct I/O and one of the subtleties that seems to have been lost in the definition of Direct I/O is the elimination of the write-ordering locks that are required for regular file system access. The serialization is normally required so that if two processes should write to the same offset in the same file, one entire write must occur before the other—thus preventing fractured writes. With databases like Oracle, no two processes will write to the same offset in the same file at the same time. So why have the OS impose such locking? It doesn’t with modern file systems that support Direct I/O.

In regards to the blog entry called ASM is “not really an optional extra” With BIGFILE Tablespaces, a reader posted the following comment:

“node locks are only an issue when file metadata changes”
This is the first time I’ve heard this. I’ve had a quick scout around various sources, and I can’t find support for this statement.
All the notes on the subject that I can find show that inode/POSIX locks are also used for controlling the order of writes and the consistency of reads. Which makes sense to me….

Refer to:
http://www.ixora.com.au/notes/inode_locks.htm

Sec 5.4.4 of
http://www.phptr.com/articles/article.asp?p=606585&seqNum=4&rl=1

Sec 2.4.5 of
http://www.solarisinternals.com/si/reading/oracle_fsperf.pdf

Table 15.2 of
http://www.informit.com/articles/article.asp?p=605371&seqNum=6&rl=1

Am I misunderstanding something?

And my reply:

…in short, yes. When I contrast ASM to a file system, I only include direct I/O file systems. The number of file systems and file system options that have eliminated the write-ordering locks is a very long list starting, in my experience, with direct async I/O on Sequent UFS as far back as 1991 and continuing with VxFS with Quick I/O, VxFS with ODM, PolyServe PSFS (with the DBOptimized mount option), Solaris UFS post Sol8-U3 with the forcedirectio mount option and others I’m sure. Databases do their own serialization so the file system doing so is not needed.

The ixora and solarisinternals references are very old (2001/2002). As I said, Solaris 8U3 direct I/O completely eliminates write-ordering locks. Further, Steve Adams also points out that Solaris 8U3 and Quick I/O where the only ones they were aware of, but that doesn’t mean VxFS ODM (2001), Sequent UFS (starting in 1992) and ptx/EFS, and PolyServe PSFS (2002) weren’t all supporting completely unencumbered concurrent writes.

Ari, thanks for reading and thanks for bringing these old links to my attention. Steve is a fellow Oaktable Network Member…I’ll have to let him know about this out of date stuff.

There is way too much old (and incomplete) information out there.

A Quick Test Case to Prove the Point
The following screen shot shows a shell process on one of my Proliant DL585s with Linux RHEL 4 and the PolyServe Database Utility for Oracle. The session is using the PolyServe PSFS filesystem mounted with the DBOptimized mount option which supports Direct I/O. The test consists of a single dd(1) process overwriting the first 8GB of a file that is a little over 16GB. The first invocation of dd(1) writes 2097152 4KB blocks in 283 seconds for an I/O rate of 7,410 writes per second. The next test consisted of executing 2 concurrent dd(1) processes each writing a 4GB portion of the file. Bear in mind that the age old, decrepit write-ordering locks of yester-year serialized writes. Without bypassing those write locks, two concurrent write-intensive processes cannot scale their writes on a single file. The screen shot shows that the concurrent write test achieved 12,633 writes per second. Although 12,633 represents only 85% scale-up, remember, these are physical I/Os—I have a lot of lab gear, but I’d have to look around for a LUN that can do more than 12,633 IOps and I wanted to belt out this post. The point is that on a “normal” file system, the second go around of foo.sh with two dd(1) processes would take the same amount of time to complete as the single dd(1) run. Why? Because both tests have the same amount of write payload and if the second foo.sh suffered serialization the completion times would be the same:

conc_write2.JPG

Oracle on Opteron with Linux–The NUMA Angle (Part I)

There are Horrible Definitions of NUMA Out There on the Web
I want to start blogging about NUMA with regard to Oracle because NUMA has reached the commodity hardware scene with Opteron and Hypertransport technology Yes, I know Opteron has been available for a long time, but it wasn’t until the Linux 2.6 Kernel that there were legitimate claims of the OS being NUMA-aware. Before I can start blogging about NUMA/Oracle on Opteron related topics, I need to lay down some groundwork.

First, I’ll just come out and say it, I know NUMA—really, really well. I spent the latter half of the 1990’s inside the Sequent Port of Oracle working out NUMA-optimizations to exploit Sequent NUMA-Q 2000—the first commercially available NUMA system. Yes, Data General, SGI and Digital were soon to follow with AViiON, Origin 2000 and the AlphaServer GS320 respectively. The first port of Oracle to have code within the kernel specifically exploiting NUMA architecture was the Sequent port of Oracle8i.

 

Glossary
I’d like to offer a couple of quick definitions. The only NUMA that matters where Oracle is concerned is Cache Coherent NUMA (a.k.a CC-NUMA):

NUMA – A microprocessor-based computer system architecture comprised of compute nodes that possess processors and memory and usually disk/network I/O cards. A CC-NUMA system has specialized hardware that presents all the varying memory components as a single memory image to the processors. This has historically been accomplished with crossbar, switch or SCI ring technologies. In the case of Opteron, NUMA is built into the processor since each processor has an on-die memory controller. Understanding how a memory reference is satisfied in a NUMA system is the most important aspect of understanding NUMA. Each memory address referenced by the processors in a NUMA system is essentially “snooped” by the “NUMA memory controller” which in turn determines if the memory is local to the processor or remote. If remote, the NUMA “engine” must perform a fetch of the memory and install it into the requesting processor cache (which cache depends on the implementation although most have historically implemented an L3 cache for this remote-memory “staging”). The NUMA “engine” has to be keenly tuned to the processor’s capabilities since all memory related operations have to be supported including cache line invalidations and so forth. Implementations have varied wildly since the early 1990s. There have been NUMA systems that were comprised of complete systems linked by a NUMA engine. One such example was the Sequent NUMA-Q 2000 which was built on commodity Intel-based Pentium systems “chained” together by a very specialized piece of hardware that attached directly to each system bus. That specialized hardware was the called the Lynx Card which had an OBIC (Orion Bus Interface Controller) and a SCLIC (SCI Line Interface Controller) as well as 128MB L3 remote cache. On the Lynx card was a 510-pin GaAs ASIC that served as the “data pump” of the NUMA “engine”. These commodity NUMA “building blocks” were called “Quads” because they had 4 processors, local memory, local network and disk I/O adaptors—a lot of them. Digital referred to their physical building blocks as QBB (Quad Building Blocks) and logically (in their API for instance) as“RAD”s for Resource Affinity Domains. In the case of Opteron, each processor is considered a “node” with only CPU and memory locality. With Opteron, network and disk I/O are uniform.

NUMA Aware – This term applies to software. NUMA-aware software is optimized for NUMA such that the topology is understood and runtime decisions can be made such as what segment of memory to allocate from or what adaptor to perform I/O through. The latter, of course, not applying to Opteron. NUMA awareness starts in the kernel and with a NUMA API, applications too can be made NUMA aware. The Linux 2.6 Kernel had NUMA awareness built into the kernel—to a certain extent and there has been a NUMA API available for just as long. Is the Kernel fully NUMA-optimized? Not by any stretch of the imagination. Is the API complete? No. Does that mean the Linux NUMA-related technology is worthless? That is what I intend to blog about.

Some of the good engineers that build NUMA-awareness into the Sequent NUMA-Q operating system—DYNIX/ptx—have contributed NUMA awareness to Linux through their work in the IBM Linux Technology Center. That is a good thing.

This thread on Opteron and Linux NUMA is going to be very Oracle-centric and will come out as a series of installments. But first, a trip down memory lane.

The NUMA Stink
In the year 2000, Sun was finishing a very anti-NUMA campaign. I remember vividly the job interview I had with Sun’s Performance, Availability and Architecture Engineering (PAE) Group lead by Ganesh Ramamurthy. Those were really good guys, I enjoyed the interview and I think I even regretted turning down their offer so I could instead work in the Veritas Database Editions Group on the Oracle Disk Manager Library. One of the prevailing themes during that interview was how hush, hush, wink, wink they were about using the term NUMA to describe forthcoming systems such as StarCat. That attitude even showed in the following Business Review Online article where the VP Enterprise Systems at Sun in that time frame stated:

“We don’t think of the StarCat as a NUMA or COMA server,” he said. “This server has SMP latencies, and it is just a bigger, badder Starfire.”

No, it most certainly isn’t a COMA (although it did implement a few of the aspects of COMA) and it most certainly has always been a NUMA. Oops, I forgot to define COMA…next entry…and, oh, Opteron has made saying NUMA cool again!

 

A Day for Typos. Let’s move the “c” and “n” Keys, OK?

Two typos in one session. If ci(1) and nash(8) are important I think we should mode “c” far away from “v” and “”n from “b” on the QWERTY keyboard. When I think vi and bash, I’m not thinking ci(1) and nash(8)…Votes?

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linu
$ cd /tmp
$ mkdir foo
$ cd foo
$ cat > foo.sh
echo this is a dumb script
$ chmod +x foo.sh
$ bash foo.sh
this is a dumb script
$ nash foo.sh
(running in test mode).
Red Hat nash version 4.2.1.6 starting
(echo) this is a dumb script
$ ci foo.sh
foo.sh,v <– foo.sh
enter description, terminated with single ‘.’ or end of file:
NOTE: This is NOT the log message!
>>
RCS: Quit
RCS: Cleaning up.
$ type nash
nash is hashed (/sbin/nash)
$ bash –help
GNU bash, version 3.00.15(1)-release-(x86_64-redhat-linux-gnu)
Usage: bash [GNU long option] [option] …
bash [GNU long option] [option] script-file …
GNU long options:
–debug
–debugger
–dump-po-strings
–dump-strings
–help
–init-file
–login
–noediting
–noprofile
–norc
–posix
–protected
–rcfile
–rpm-requires
–restricted
–verbose
–version
–wordexp
Shell options:
-irsD or -c command or -O shopt_option (invocation only)
-abefhkmnptuvxBCHP or -o option
Type `bash -c “help set”‘ for more information about shell options.
Type `bash -c help’ for more information about shell builtin commands.
Use the `bashbug’ command to report bugs.

The Decommissioning of the Oracle Storage Certification Program

I’ve known about this since December 2006, but since the cat is out of the proverbial bag, I can finally blog about it.

Oracle has taken another step to break down Oracle-over-NFS adoption barriers. In the early days of Oracle supporting deployments of Oracle over NFS, the Oracle Storage Compatibility Program (OSCP) played a crucial role in ensuring a particular NAS device was suited to the needs of an Oracle database. Back then the model was immature but a lot has changed since then. In short, if you are using Oracle over NFS, storage-related failure analysis is as straight-forward as it is with a SAN. That is, it takes Oracle about the same amount of time to determine fault is in the storage—downwind of their software—with either architecture. To that end, Oracle has announced the decommissioning of the Oracle Storage Compatibility Program. The URL for the OSCP (click here , or here for a a copy of the web page in the Wayback Machine) states the following (typos preserved):

At this time Oracle believes that these three specialized storage technologies are well understood by the customers, are very mature, and the Oracle technology requirements are well know. As of January, 2007, Oracle will no longer validate these products. We thank our partners for their contributions to the OSCP.

Lack of Choice Does Not Enable Success
It will be good for Oracle shops to have even more options to choose from when selecting a NAS-provider as an Oracle over NFS platform.  I look forward to other players to emerge on the scene. This is not just Network Appliance’s party by any means. Although I don’t have first-hand experience, I’ve been told that the BlueArc Titan product is a very formidable platform for Oracle over NFS—but it should come as no surprise that I am opposed to vendor lock-in.

Oracle Over NFS—The Demise of the Fibre Channel SAN
That seems to be the conclusion people draw when Oracle over NFS comes up. That is not the case, so your massive investment in SAN infrastructure was not a poor choice. It was the best thing going at the time. If you have a formidable SAN, you would naturally use a SAN-gateway to preserve your SAN investment while reducing the direct SAN connectivity headaches. In this model deploying another commodity server is as simple as plugging in Cat 5 cabling, and mounting an exported NFS filesystem from the SAN gateway. No raw partitions to fiddle with on the commodity server, no LUNs to carve out on the SAN and most importantly, no FCP connectivity overhead. All the while, the data is stored in the SAN so your existing backup strategy applies. This model works for Linux, Solaris, HPUX, AIX.

Oracle over NFS—Who Needs It Anyway?
The commodity computing paradigm is so drastically different than the central server approach we grew to know in the 1990s. You know, one or two huge servers connected to DAS or a SAN. It is very simple to run that little orange cabling from a single cabinet to a couple of switches. These days people throw around terms like grid without ever actually drawing a storage connectivity schematic. Oracle’s concept of a grid is, of course, a huge Real Application Clusters database spread out over a large number of commodity servers. Have you ever tried to build one with a Fibre Channel SAN? I’m not talking about those cases where you meet someone at an Oracle User Group that refers to his 3 clustered Linux servers running RAC as a grid. Oh how I hate that! I’m talking about connecting, say, 50,100 or 250 servers all running Oracle—some RAC, but mostly not—to a SAN. I’m talking about commodity computing in the Enterprise—but the model I’m discussing is so compelling it should warrant consideration from even the guy with the 3-node “grid”. I’m talking, again, about Oracle over NFS—the simplest connectivity and storage provisioning model available for Oracle in the commodity computing paradigm.

Storage Connectivity and Provisioning
Providing redundant storage paths for large numbers of commodity servers with Fibre Channel is too complex and too expensive. Many IT shops are spending more than the cost of each server to provide redundant SAN storage paths since each server needs 2 Host Bus Adaptors (or a dual port HBA) and 2 ports in large Director-class switches (at approximately USD $4,000 per). These same servers are also fitted with Gigabit Ethernet. How many connectivity models do you want to deal with? Settle on NFS for Oracle and stick with bonded Gigabit Ethernet as the connectivity model—very simple! With the decommissioning of the OSCP, Oracle is making the clear statement that Oracle over NFS is no longer an edge-case deployment model. I’d recommend giving it some thought.

EMC’s MPFSi for Oracle: Enjoy It While It Lasts, or Not.

Regular readers of my blog know that I am a proponent of Oracle over NFS—albeit in the commodity computing space. I’ll leave those Superdomes and IBM System p servers with their direct SAN plumbing. So I must therefore be a huge fan of EMC’s Celerra MPFSi—the Multi-Path Filesystem, right? No, I’m not. This blog post is about why not MPFSi.

In this paper about EMC MPFSi, pictures speak a thousand words. But first, some of my own—with an Oracle-centric take. MPFSi would be just fine I suppose except it is both an NFS server-side architecture and a proprietary NFS client package. The following screen shot shows a basic diagram of Celerra with MPFSi. First, there are three components at the back end. One is the Celerra and another is an MDS 9509 Connectrix. The Celerra is there to service NAS filesystem metadata operations and the Connectrix with some iSCSI glue is there to transfer data requests in block form. That is, if you create a file and immediately write a block to it, you will have the file creation satisfied by the Celerra and the block write by the Connectrix. The final component is the SAN—since Celerra is a SAN-gateway. There is nothing wrong with SAN gateways by any means. I think SAN gateways are the best way to leverage a SAN for provisioning storage to the legacy monolithic Unix systems as well as the large number of commodity servers sitting on the same datacenter floor. That is, SAN to the legacy Unix system and SAN-gateway-NFS to the commodity servers. That’s tiered storage. Ultimately you have a single SAN holding all the data, but the provisioning and connectivity model of the gateway side is much better suited to large numbers of commodity servers than FCP. Here is the simplified topology of MPFSi:

NOTE, some browsers require you to right click->view.

smallcelerra-1.jpg

 

 

MPFSi requires NFS client-side software. The software presents a filesystem that is compatible with NFS protocols. There is an agent that intercepts NFS protocol messages and forwards them to the Celerra which then does with it what it will as per the MPFSi architecture as the following screen shot shows.

smallcelerra-2.jpg

What’s This Have to do With Oracle?
So what’s the big deal? Well, I suppose if you absolutely need to stay with EMC as your SAN gateway vendor, then this is the choice for you. There are SAN-agnostic choices for SAN gateways as I’ve pointed out on this blog too many times. What about Oracle? Since Oracle10g supports NFS in the traditional model, I’m sure MPFSi works just fine. What about 11g? We’ve all heard “rumors” that 11g has a significant NFS-improvement focus. It is good enough with 10g, but 11g aims to make it an even better I/O model. That is good for Oracle’s On Demand hosting business since they use NFS exclusively. Will the 11g NFS enhancements function with MPFSi? Only an 11g beta program participant could tell you at the moment. I also know that the beta program legalese essentially states that participants can neither confirm nor deny whether they are, or are not, Oracle11g beta program participants. I’ll leave it at that.

Oracle over NFS is Not a Metadata Problem
When Oracle accesses files over NFS, there is no metadata overhead to speak of. Oracle is a simple lseek, read/write engine as far as NFS is concerned and there is no NFS client cache to get in the way either. Oracle opens files on NFS filesystems with the O_DIRECT flag. This alleviates a good deal of the overhead typical NFS exhibits. Oracle has an SGA, it doesn’t need NFS client-side cache. So MPFSi is not going to help where scalable NFS for Oracle is concerned. MPFSi better addresses the traditional problems with scaling home shares and so on.

Using Absolutely Dreadful Whitepapers as Collateral
Watch out if you read this ESG paper on EMC MPFSi because a belt sander might just drop from the ceiling and grind you to a fine powder as punishment for exposing yourself to such spam. This paper is a real jewel. If you dare risk the belt sander, I’ll leave it to you to read the whole thing. I’d like to point out, however, that it shamelessly uses relative performance numbers without the trouble of filling in any baselines for us in the performance section. For instance, the following shot shows a “graph” in the paper where the author makes the claim that MPFSi performs 300% better than normal NFS. This is typical chicanery—without the actual throughput achieved at the baseline, we can’t really ascertain what was tested. I have a $5 bet that the baseline was not, say, triple-bonded GbE delivering some 270+ MB/sec.

smallcelerra-3.PNG

 

No Blog Entries Over The Weekend!

I’ve been told that a blog without photos is too boring. Well, it just so happens that the reason I didn’t make any blog entries over the past weekend was because I was down at the family farm making a mess and taking photos. The job at hand was to relocate the pump that supplies the house and barn with water from one spring to another. It was messy, but first, a photo from the driveway…nice country.

dig2

When we first arrived we got to see the condition the contractors left their equipment in. We know where the hard ground is and told them how to approach the spring, but they had their ideas and wound up stuck up to the chassis in cold mud with a track missing:

dig6.jpg

That was pretty late in the day. The next morning I had to jump in there for a photo—while I was still nice and tidy.

dig5.jpg

What ensued after that photo was about 6 hours of toiling with the contractors and our family machinery to get that thing out of the mudhole. Next, they relocated the machine into position for digging the new pump location while an old friend of mine and I did some fence repair.

You can’t really see it well, but the machine is tethered to that young Douglas Fir tree behind it—or else the machine was going right into the hole being dug. Or maybe my daughter is just holding it in place…

The next shot shows the hole complete at about 8 feet deep with a 12’ x 4’ culvert positioned on-end to prevent caving. At that point the spring was producing about 200 to 300 GPM into the hole—a very dependable water source. Within hours the water was running crystal clear. The next task is to place some 15 cubic yards of 5″-open rock and a fabric barrier then the pump goes in and the whole this is capped off.

dig4.jpg

The next shot puts it into perspective with a view from the house down into the hole where the spring is. Steep country. Nice farm. Good time had by all.

dig3

There, I did it! Another blog entry with photos!

The 10.2.0.3 Patchset with VxFS Saga: An Example of Incorrectly Describing the Incorrectness

In the blog entry entitled “Oracle 10.2.0.3 Patchset is Not Functional with Solaris SPARC 64-bit and Veritas Filesystem”, I pointed out that the the 10.2.0.3 patchset was not functional if your database resides in VxFS (bug 5747918). There is updated information now, but first a bit of humor.

In the solution section of the note covering this bug, Metalink note 405825.1 states:

Workaround
————–
Move the entire database to a non-Veritas filesystem

Resolution
———–
Download and apply Metalink patch:5752399.
The instructions to apply patch:5752399 are included in the patch README file.

Move the entire database? Uh, I’d go for the patch for the patchset. Or as I’ve already pointed out, Oracle Disk Manager is not affected by this bug at all.

The Patch for the Patchset
Oracle Patch number 5752399 is considered a mandatory patch for the 10.2.0.3 patchset.

Incorrectly Describing the Incorrectness
Regarding the nature of the bug, Metalink note
405825.1 incorrectly states:

The 10.2.0.3 patchset code changes attempted to use directio with vxfs (Veritas) filesystems, which vxfs does not support.

On the contrary, VxFS does support direct I/O via:

  • Quick I/O
  • ODM
  • VxFS mount options (e.g., convosync)

This documentation on Sun’s website gets it correctly:

If you are using databases with VxFS and if you have installed a license key for the VERITAS Quick I/O™ for Databases feature, the mount command enables Quick I/O by default. The noqio option disables Quick I/O. If you do not have Quick I/O, mount ignores the qio option. Alternatively, you can increase database performance using the mount option convosync=direct, which utilizes direct I/O.

Correctly Describing the Incorrectness
Since the Metalink note got it wrong by stating that VxFS doesn’t support “directio” (a.k.a Direct I/O), I’ll clear it up here. As I stated in
this blog entry, the true nature of the bug is that the 10.2.0.3 porting team implemented a call to the Solaris directio(3C) library routine which is a way to push Direct I/O onto a UFS file, but is not supported by VxFS. There, now, doesn’t that make more sense? Am I being a stickler? Yes, because there is a huge difference between the two following phrases:

attempted to use directio with vxfs

attempted to use directio(3C) with vxfs

Workaround
Did they really suggest moving an entire database as a workaround for a misplaced call to directio(3C)?

Busy Idle Processes. Huh? The AIX KPROC process called “wait”.

A recent thread on the oracle-l email list was about the AIX 5L KPROC process called “wait”. The email that started the thread reads:

We are reviewing processes on our P690 machine and get the following.

I’ve googled a little bit but can’t find anything of interest. Are these processes that I should be concerned with – should we kill them? A normal ps -ef | grep 45078 does not return the process, so I really can’t figure out what these are.

$ ps auxw | head -10

USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 45078 9.3 0.0 48 36 – A Oct 13 120026:37 wait

root 40980 9.0 0.0 48 36 – A Oct 13 116428:47 wait

root 36882 8.9 0.0 48 36 – A Oct 13 114010:26 wait

root 32784 8.8 0.0 48 40 – A Oct 13 113205:56 wait

[…output truncated…]

Another participant in the thread followed up with:

you will find the answer in:

http://www-304.ibm.com/jct09002c/isv/tech/faq/individual.jsp?oid=1:89156

And yet another good member of the list added:

Also, the reason you don’t see it with “ps -ef” is that ps doesn’t show kernel processes by default – you have to specify the “-k” flag, e.g.:

/opt/oracle ->ps -efk|grep wait

root 8196 0 0 Nov 11 – 720:31 wait
root 53274 0 0 Nov 11 – 3628:35 wait
root 57372 0 0 Nov 11 – 554:40 wait
root 61470 0 0 Nov 11 – 1883:24 wait
[…output truncated…]

So What Do I Have To Add?
So why am I blogging about this if the mystery has been explained? Well, I think having a kernel process attributed with time when the processor is in the idle loop is just strange. Microprocessors only have two states; running and idle. On a Unix system, the running state is attributed to either user or kernel mode. Attributing the idle state to anything is like charging nothing to something.

Yes, I suppose I’m nit-picking. There is something about the running state that I find so many people do not know and it has to do with processor efficiency. Regardless of which mode—user or kernel—the processor monitoring tools can only report that the processor was idle or not. That’s all. Processor monitoring tools (e.g., vmstat, sar, etc) cannot report processor efficiency. Remember that a processor is not always getting work done efficiently. Not that there is anything you can do about it, but a processor running in either mode accessing heavily contended memory is getting very little work done per cycle. The term CPI (cycles per instruction)  is used to represent this efficiency. Think of it this way, if a CPU accesses a memory location in cache, the instruction completes in a couple of CPU cycles. If the processor is accessing a word in a memory line that is being completely hammered by other processors (shared memory), that single instruction will stall the processor until it completes. As such, the workload is said to execute with a high CPI.

There you have it, some trivial pursuit.

What Does This Have To Do With Oracle?
Well, I’ll give you an example. A process spinning on a latch is executing the test loop in cache. The loop executes at a very, very low CPI. So if you have a lot of processes routinely spinning on latches, you have a low CPI—but that doesn’t mean you are getting any throughput. Latch contention is just tax if you will. When the latch is released, the processors that are spinning get a cacheline invalidation. They immediately read the line again. The loading of that line brings the CPI way up for a moment as the line is installed into cache, and on and on it goes. The “ownership” of the memory line with the latch structure just ping-pongs around the box. Envision a bunch of one-armed people standing around passing around a hot potato. Yep, that about covers it. No, not actually. Somewhere there has to be a copy of the potato and a race to get back to the original. Hmmm, I’ll have to work on that analogy—or take an interest in hierarchical locking. <smiley>

Therein lies the reason that just a few contended memory lines with really popular Oracle latches (e.g., redo allocation, hot chains latches, etc) can account for reasonable percentages of the work that gets done on an Oracle system. On the other hand, systems with really balanced processor/memory capabilities (e.g., System p, Opteron on Hypertransport, etc), and systems with very few processors don’t have much trouble with this stuff. And, of course, Oracle is always working to eliminate singleton latches as well.

 

Analysis and Workaround for the Solaris 10.2.0.3 Patchset Problem on VxFS Files

In the blog entry about the Solaris 10.2.0.3 patchset not functioning on VxFS, I reported that Metalink says the patchset does not work on VxFS. That is true. Since the Metalink notes have not been updated, I’ll blog a bit about what I’ve found out. Note, the Metalink note says not to use the patchset because of this bug. I am not here to fight Oracle support.

It turns out that what is happening is the Solaris porting group is now using an ioctl() that is not supported on VxFS files—but not calling the ioctl(2) directly. The bug results in an error stack a bit like this:

ORA-01501: CREATE DATABASE failed
ORA-00200: control file could not be created
ORA-00202: control file: ‘/some/path/control01.ctl’
ORA-27037: unable to obtain file status
SVR4 Error: 25: Inappropriate ioctl for device

The text in bug number 5747918 is nice enough to include the output of truss when the problem happens. The ioctl() is _ION. This is the ioctl(2) that is implemented within the directio(3C) library routine. No, don’t believe this developers.sun.com webpage when it refers to directio(3C) as a system call. It isn’t. However, they do provide an example of using the directio(3C) call in this small directio(3C) test program.

The Solaris directio(3C) call is used to push direct I/O onto a file. In the demonstration of the bug (5747918), the 10.2.0.3 patchset is trying to push direct I/O onto the file descriptor held on the control file stored in VxFS. That isn’t how you get direct I/O on VxFS. I wonder if this call to directio(3C) only happens if you have filesystemio_options=DirectIO|setall. That would make sense.

Workaround
If you use ODM on VxFS, this call to directio(3C) does not occur so you wont see the problem. Thanks to a reader comment on my blog and my age old friend still at Veritas (I mean Symantec) for verification that ODM works around the problem.

A Test Program

If you create a file in a VxFS mount called “foo”, like this:

$ dd if=/dev/zero of=foo bs=4096 count=16

And then compile and run the following small program, you will see the same problem Oracle 10.2.0.3 is exhibiting. The same program on UFS should work fine.

$ cat t.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#include <stdio.h>
#include <sys/types.h>
#include <sys/fcntl.h>

main ()
{

int ret, handle;
handle = open (“./foo”, O_RDONLY);
if ((ret = directio (handle, DIRECTIO_ON)) < 0)
{
printf (“Failure : return code is: %d\n”, ret);
}
else
{
printf
(“The ioctl embedded in the directio(3C) call functions on the file.\n”);
}
} /* End */

Another Potential Workaround
If you want to test the rest of what 10.2.0.3 has to offer without ODM—even with a VxFS database—I think you should be able to explicitly set filesystemio_options=none and get around the problem. Be aware I have not tested that however. The worst thing that could happen is that setting filesystemio_options in this manner is indeed a workaround that would allow you to test the many other reasons you actually need 10.2.0.3!

If you find otherwise, please comment on the blog.

Oracle 10.2.0.3 Patchset is Not Functional with Solaris SPARC 64-bit and Veritas Filesystem

A Faulty Patchset
One of the members of the oracle-l email list just posted up that 10.2.0.3 is a non-functional patchset for Solaris SPARC 64-bit if you are using Veritas VxFS for datafiles. I just checked Metalink and found that this is Oracle Bug 5747918 covered in Metalink Note: 405825.1.

According to the email on oracle-l:

Applies to:

Oracle Server – Enterprise Edition – Version: 10.2.0.3 to 10.2.0.3
Oracle Server – Standard Edition – Version: 10.2.0.3 to 10.2.0.3
Solaris Operating System (SPARC 64-bit)
Oracle Database Server 10.2.0.3
Sun Solaris Operating System
Veritas Filesystem

Solution

Until a patch for this bug is available, or 10.2.0.3 for Solaris is re-released, you should restore from backups made before the attempted upgrade. Do not attempt to upgrade again until a fix for this issue is available. Bug:5747918 is published on Metalink if you wish to follow the progress.

I’d stay with 10.2.0.2 I think.

Comparing 10.2.0.1 and 10.2.0.3 Linux RAC Fencing. Also, Fencing Failures (Split Brain).

BLOG UPDATE 2011.08.11 : For years my criticism of Oracle Clusterware fencing methodology brought ire from many who were convinced I was merely a renegade. The ranks of “the many” in this case were generally well-intended but overly convinced that Oracle was the only proven clustering technology in existence.  It took many years for Oracle to do so, but they did finally offer support for IPMI fencing integration in the 11.2 release of Oracle Database. It also took me a long time to get around to updating this post.  Whether by graces of capitulation or a reinvention of the wheel, you too can now, finally, enjoy a proper fencing infrastructure. For more information please see: http://download.oracle.com/docs/cd/E11882_01/rac.112/e16794/admin.htm#CHDGIAAD

I’ve covered the clusters concept of fencing quite a bit on this blog (e.g., RAC Expert or Clusters Expert and Now is the Time to Open Source, etc), and in papers such as this paper about clusterware, and in an appendix in the Julian Dyke/Steve Shaw book about RAC on Linux. If I’ve said it once, I’ve said it 1000 times; if you are not a clusters expert you cannot be a RAC expert. Oddly though, Oracle seems to be sending a message that clusterware is commoditized—and it really isn’t. On the other hand, Oracle was brilliant for heading down the road of providing their own clusterware. Until all the kinks are worked out, it is good to know as much as you can about what is under the covers.

Linux RAC “Fencing”
As I’ve pointed out in the above referenced pieces, Oracle “fencing” is not implemented by healthy servers taking action against rogue servers (e.g., STONITH), but instead the server that needs to be “fenced” is sent a message. With that message, the sick server will then reboot itself. Of course, a sick server might not be able to reboot itself. I call this form of fencing ATONTRI (Ask The Other Node To Reboot Itself).This blog entry is not intended to bash Oracle clusterware “fencing”—it is what it is, works well and for those who choose there is the option of running integrated Legacy clusterware or validated third party clusterware to fill in the gaps. Instead, I want to blog about a couple of interesting observations and then cover some changes that were implemented to the Oracle init.cssd script under 10.2.0.3 that you need to be aware of.

Logging When Oracle “Fences” a Server
As I mentioned in this blog entry about the 10.2.0.3 CRS patchset, I found 10.2.0.1 CRS—or is that “clusterware”—to be sufficiently stable to just skip over 10.2.0.2. So what I’m about to point out might be old news to you folks. The logging text produced by Oracle clusterware changed between 10.2.0.1 and 10.2.0.3. But, since CRS has a fundamental flaw in the way it logs this text, you’d likely never know it.

Lot’s of Looking Going On
As an aside, one of the cool things about bloggingis that I get to track the search terms folks use to get here. Since the launch of my blog, I’ve had over 11000 visits from readers looking for information about the most common error message returned if you have a botched CRS install on Linux—that text being:

PROT-1: Failed to initialize ocrconfig

No News Must Be Good News
I haven’t yet blogged about the /var/log/messages entry you are supposed to see when Oracle fences a server, but if I had, I don’t think it would be a very common google search string anyway? No the reason isn’t that Oracle so seldomly needs to fence a server. The reason is that the text generally (nearly never actually) doesn’t make it into the system log. Let’s dig into this topic.

The portion of the init.cssd script that acts as the “fencing” agent in 10.2.0.1 is coded to produce the following entry in the /var/log/messages file via the Linux logger(1) command (line numbers precede code):

194 LOGGER=”/usr/bin/logger”
[snip]
1039 *)
1040 $LOGERR “Oracle CSSD failure. Rebooting for cluster integrity.”
1041
1042 # We want to reboot here as fast as possible. It is imperative
1043 # that we do not flush any IO to the shared disks. Choosing not
1044 # to flush local disks or kill off processes gracefully shuts
1045 # us down quickly.
[snip]
1081 $EVAL $REBOOT_CMD

Let’s think about this for a moment. If Oracle needs to “fence” a server, the server that is being fenced should produce the followingtext in /var/log/messages:

Oracle CSSD failure.Rebooting for cluster integrity.

Where’s Waldo?
Why is it when I google for “Oracle CSSD failure.Rebooting for cluster integrity” I get 3, count them, 3 articles returned? Maybe the logger(1) command simply doesn’t work? Let’s give that a quick test:

[root@tmr6s14 log]# logger “I seem to be able to get messages to the log”
[root@tmr6s14 log]# tail -1 /var/log/messages
Jan 9 15:16:33 tmr6s14 root: I seem to be able to get messages to the log
[root@tmr6s14 log]# uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux

Interesting. Why don’t we see the string Oracle CSSD failure when Oracle fences then? It’s because the logger(1) command merely sends a message to syslogd(8) via a socket—and then it is off to the races. Again, back to the 10.2.0.1 init.cssd script:

22 # FAST_REBOOT – take out the machine now. We are concerned about
23 # data integrity since the other node has evicted us.
[…] lines deleted
177 case $PLATFORM in
178 Linux) LD_LIBRARY_PATH=$ORA_CRS_HOME/lib
179 export LD_LIBRARY_PATH
180 FAST_REBOOT=”/sbin/reboot -n -f”

So at line 1040, the script sends a message to syslogd(8) and then immediately forces a reboot at line 1081—with the –n option to the reboot(8) command forcing a shutdown without sync(1). So there you have it, the text is drifting between the bash(1) context executing the init.cssd script and the syslogd(8) process that would do a buffered write anyway. I think the planets must really be in line for this text to ever get to the /var/log/messages file—and I think the google search for that particular string goes a long way towards backing up that notion. When I really want to see this string pop up in /var/log/messages, I fiddle with putting sync(1) comands and sleep before the line 1081. That is when I am, for instance, pulling physical connections from the Fibre Channel SAN paths and studying what Oracle behaves like by default.

By the way, the comments at lines 22-23 are the definition of ATONTRI.

Paranoia?
I’ve never understood that paranoia at lines 1042-1043 which state:

We want to reboot here as fast as possible. It is imperative that we do not flush any IO to the shared disks.

It may sound a bit nit-picky, but folks this is RAC and there are no buffered writes to shared disk! No matter really, even if there was a sync(1) command at line 1080 in the 10.2.0.1 init.cssd script, the likelihood of getting text to /var/log/messages is still going to be a race as I’ve pointed out.

Differences in 10.2.0.3
Google searches for fencing articles anchored with the Oracle CSSD failure string are about to get even more scarce. In 10.2.0.3, the text that the script attempts to send to the /var/log/messages file changed—the string no longer contains CSSD, but CRS instead. The following is a snippet from the init.cssd script shipped with 10.2.0.3:

452 *)
453 $LOGERR “Oracle CRS failure. Rebooting for cluster integrity.”

A Workaround for a Red Hat 3 Problem in 10.2.0.3 CRS
OK, this is interesting. In the 10.2.0.3 init.cssd script, there is a workaround for some RHEL 3 race condition. I would be more specific about this, but I really don’t care about any problems init.cssd has in its attempt to perform fencing since for me the whole issue is moot. PolyServe is running underneath it and PolyServe is not going to fail a fencing operation. Nonetheless, if you are not on RHEL 3, and you deploy bare-bones Oracle-only RAC (e.g., no third party clusterware for fencing), you might take interest in this workaround since it could cause a failed fencing. That’s split-brain to you and I.

Just before the actual execution of the reboot(8) command, every Linux system running 10.2.0.3 will now suffer the overhead of the code starting at line 489 shown in the snippet below. The builtin test of the variable $PLATFORM is pretty much free, but if for any reason you are on a RHEL 4, Novell SuSE SLES9 or even Oracle Enterprise Linux (who knows how they attribute versions to that) the code at line 491 is unnecessary and could put a full stop to the execution of this script if the server is in deep trouble—and remember fencings are suppose to handle deeply troubled servers.

Fiddle First, Fence Later
Yes, the test at line 491 is a shell builtin, no argument, but as line 226 shows, the shell command at line 491 is checking for the existence of the file /var/tmp/.orarblock. I haven’t looked, but bash(1) is most likely calling open(1) with O_CREAT and O_EXCL and returning true on test –e if the open(1) call gets EEXIST returned and false if not. In the end, however, if checking for the existence for a file in /var/tmp is proving difficult at the time init.cssd is trying to “fence” a server, this code is pretty dangerous since it can cause a failed fencing on a Linux RAC deployment. Further, at line 494 the script will need to open a file and write to it. All this on a server that is presumed sick and needs to get out of the cluster. Then again, who is to say that the bash process executing the init.cssd script is not totally swapped out permanently due to extreme low memory thrashing? Remember, servers being told to fence themselves (ATONTRI) are not healthy. Anyway, here is the relevant snippet of 10.2.0.3 init.cssd:

226 REBOOTLOCKFILE=/var/tmp/.orarblock
[snip]
484 # Workaround to Redhat 3 issue with multiple invocations of reboot.
485 # Here if oclsomon and ocssd are attempting a reboot at the same time
486 # then the kernel could lock up. Here we have a crude lock which
487 # doesn’t eliminate but drastically reduces the likelihood of getting
488 # two reboots at once.
489 if [ “$PLATFORM” = “Linux” ]; then
490 CEDETO=
491 if [ -e “$REBOOTLOCKFILE” ]; then
492 CEDETO=`$CAT $REBOOTLOCKFILE`
493 fi
494 $ECHO $$ > $REBOOTLOCKFILE
495
496 if [ ! -z “$CEDETO” ]; then
497 REBOOT_CMD=”$SLEEP 0″
498 $LOGMSG “Oracle init script ceding reboot to sibling $CEDETO.”
499 fi
500 fi
501
502 $EVAL $REBOOT_CMD


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.