Archive for the 'Oracle Clusterware' Category

Using Oracle Clusterware for Non-RAC Purposes

In a recent post on the oracle-l list, a participant asked:

Hi, has anyone used 10.2 Clusterware with OCFS2 on RHEL5 to get single instance failover from one host to another?

My buddy Matt Zito (we’ve had beers before so we’re buddies) of GridApp followed up with:

I have a customer that does that – it apparently works very well […text deleted…]

However, the downside of CRS as single-instance is that both sides of the cluster need to be licensed for Oracle (as I understand the CRS license).

Licensing is always the topic for interesting conversation. To get to the bottom of this, I sent an email to the first Oracle person I ever heard pitch the idea of CRS for non-RAC purposes-Marshall Presser. Hmm, I think I can call him my old buddy too since we also had beers. Or then again if I’m not mistaken Marshall is an old Pyramid_Technology guy and since I am an old Sequent_Computer_Systems guy, we are sort of long-lost cousins. Anyway, back to the topic. Marshall was nice enough to send me a very current reference for Oracle’s licensing terms for using CRS for non-RAC purposes with a quote from Oracle® Database Licensing Information11g Release 1 (11.1) Part Number B28287-01:

Oracle Clusterware can be installed and used to protect any Oracle or third-party software provided any of the following conditions are met:

1. The software being protected is from Oracle.

2. The software being protected uses an Oracle Database.

3. The software being protected is running on Oracle Unbreakable Linux.

4. The software being protected is running in a cluster where at least one machine involved in the cluster is licensed using the appropriate metric for either Oracle Database Enterprise Edition or Oracle Database Standard Edition. A cluster is defined to include all the machines that share the same Oracle Cluster Registry (OCR) and Voting Disk

Unclear Clarity
So, as is usually the case with licensing, we have unclear clarity. And, yes, I know this is 11g information and the original query was about 10g, but it stands to reason that with some digging there would be a 10g equivalent. I wonder why criteria 1 above is stated. Since only 1 criteria is needed, I suppose we can interpret as follows:

  • You can use CRS on Unbreakable Linux for anything you want (rule 3)
  • You can protect non-RAC Oracle databases on any platform (rule 1)
  • You can protect any software the connects to an Oracle database on any platform (rule 2)
  • You can protect anything on any cluster as long as one node in the cluster is running an instance of EE or SE (rule 4)

These are pretty liberal rules. I think Oracle is keen on widespread adoption of Oracle Clusterware for general purpose HA, but then I could be misreading the tea leaves.

What Does This Really Mean?
What we’re talking about here is using CRS to monitor (“check” in CRS parlance, “probe” in generic industry terms) an instance of Oracle and take action if the action program fails. In general failover HA terms, probes (checks in CRS terms) fail as follows:

  1. The server is up but the database is down
  2. The server is down

In case 1 above, the HA engine will restart the database and in case 2 it will fail the database over to another server. The HA engine (in this case CRS) is smart enough to fail the service over to a system that is actually alive and has functional disk access and network interfaces. That is one the roles of any HA clusterware (e.g., CRS, Steel Eye, VCS, Service Guard, HACMP, Red Hat Cluster Suite, PolyServe, etc).

Time Outs
The other way the HA engine will take action is if your probe (check script/program) seizes (times out). In that case, most HA engines will execute “restart” action which is generally a stop action followed by a start action and another probe (check). This is not an endless loop though. Most HA engines have a tunable max for retries (restart attempts in CRS) and then it will failover to the defined backup server. Be aware, however, that a seized service (such as a non-RAC database instance) could be so locked up it didn’t stop when the HA engine tried its restart action. In that case, you have Oracle processes with files open. If you failover to a server that accesses the database on a shared filesystem such as NFS or OCFS, you have some things to be concerned about. You won’t be able to start the instance until the $ORACLE_HOME/dbs/lk${ORACLE_SID} file is removed, but simply removing it still leaves that other catatonic instance up on the ill server. These solutions can become complex.

The topic of what probe (check) actions are appropriate is the subject matter of very long, drawn-out discussions rife with theory and prejudice. I’ve been there and I’ve done that. I bet most folks that use CRS to start/stop and check non-RAC databases will likely use the script interface. Note, as with all HA engines out there, you can write a C probe (or CRS action program) because all the engine is looking for is a return code (success/failure).

I think the most clever probe action I’ve heard to date came from fellow OakTable Network member Tim Gorman. Tim once suggested that a great probe action would be to make a purposeful failed attempt to connect such as:

$ sqlplus foo/bar <<EOF 2>&1 | grep 1017
> REM There is no user called foo...expect ORA 1017
> exit;
ORA-01017: invalid username/password; logon denied
$ echo $?

If you get anything other than ORA-01017, something is ill. In this case, a success for grep(1) is a success for the probe/check. That is, if grep(1) gets it’s text, the server returned ORA-01017 thus the instance was well enough to perform the functionality of user authentication. Your check script would get this in grep(1)’s return code ($?).

Trying to connect as a bogus user actually tests quite a bit of server functionality (SQL parsing, user authentication and so forth). I think this may actually create a temporary session as well. It certainly tests the server’s ability to fork(2) sqlplus and exec(2) $ORACLE_HOME/bin/oracle so you are testing the OS VM, process slots, etc.  All in all, it is a very clever probe (check action). If you wanted to use CRS to check both the health of SQL*Net and a non-RAC database instance, then you could do this same bogus connect attempt through the listener. If the listener is down, you’ll get the appropriate error text. Then again, if you wanted to make a heavy probe/check, you could connect as an application user and update a dummy row in a table or something like that. The sky is the limit with this sort of HA kit.

Additional Material
Oracle has more information in the form of whitepapers:

Comparing and Linux RAC Fencing. Also, Fencing Failures (Split Brain).

BLOG UPDATE 2011.08.11 : For years my criticism of Oracle Clusterware fencing methodology brought ire from many who were convinced I was merely a renegade. The ranks of “the many” in this case were generally well-intended but overly convinced that Oracle was the only proven clustering technology in existence.  It took many years for Oracle to do so, but they did finally offer support for IPMI fencing integration in the 11.2 release of Oracle Database. It also took me a long time to get around to updating this post.  Whether by graces of capitulation or a reinvention of the wheel, you too can now, finally, enjoy a proper fencing infrastructure. For more information please see:

I’ve covered the clusters concept of fencing quite a bit on this blog (e.g., RAC Expert or Clusters Expert and Now is the Time to Open Source, etc), and in papers such as this paper about clusterware, and in an appendix in the Julian Dyke/Steve Shaw book about RAC on Linux. If I’ve said it once, I’ve said it 1000 times; if you are not a clusters expert you cannot be a RAC expert. Oddly though, Oracle seems to be sending a message that clusterware is commoditized—and it really isn’t. On the other hand, Oracle was brilliant for heading down the road of providing their own clusterware. Until all the kinks are worked out, it is good to know as much as you can about what is under the covers.

Linux RAC “Fencing”
As I’ve pointed out in the above referenced pieces, Oracle “fencing” is not implemented by healthy servers taking action against rogue servers (e.g., STONITH), but instead the server that needs to be “fenced” is sent a message. With that message, the sick server will then reboot itself. Of course, a sick server might not be able to reboot itself. I call this form of fencing ATONTRI (Ask The Other Node To Reboot Itself).This blog entry is not intended to bash Oracle clusterware “fencing”—it is what it is, works well and for those who choose there is the option of running integrated Legacy clusterware or validated third party clusterware to fill in the gaps. Instead, I want to blog about a couple of interesting observations and then cover some changes that were implemented to the Oracle init.cssd script under that you need to be aware of.

Logging When Oracle “Fences” a Server
As I mentioned in this blog entry about the CRS patchset, I found CRS—or is that “clusterware”—to be sufficiently stable to just skip over So what I’m about to point out might be old news to you folks. The logging text produced by Oracle clusterware changed between and But, since CRS has a fundamental flaw in the way it logs this text, you’d likely never know it.

Lot’s of Looking Going On
As an aside, one of the cool things about bloggingis that I get to track the search terms folks use to get here. Since the launch of my blog, I’ve had over 11000 visits from readers looking for information about the most common error message returned if you have a botched CRS install on Linux—that text being:

PROT-1: Failed to initialize ocrconfig

No News Must Be Good News
I haven’t yet blogged about the /var/log/messages entry you are supposed to see when Oracle fences a server, but if I had, I don’t think it would be a very common google search string anyway? No the reason isn’t that Oracle so seldomly needs to fence a server. The reason is that the text generally (nearly never actually) doesn’t make it into the system log. Let’s dig into this topic.

The portion of the init.cssd script that acts as the “fencing” agent in is coded to produce the following entry in the /var/log/messages file via the Linux logger(1) command (line numbers precede code):

194 LOGGER=”/usr/bin/logger”
1039 *)
1040 $LOGERR “Oracle CSSD failure. Rebooting for cluster integrity.”
1042 # We want to reboot here as fast as possible. It is imperative
1043 # that we do not flush any IO to the shared disks. Choosing not
1044 # to flush local disks or kill off processes gracefully shuts
1045 # us down quickly.

Let’s think about this for a moment. If Oracle needs to “fence” a server, the server that is being fenced should produce the followingtext in /var/log/messages:

Oracle CSSD failure.Rebooting for cluster integrity.

Where’s Waldo?
Why is it when I google for “Oracle CSSD failure.Rebooting for cluster integrity” I get 3, count them, 3 articles returned? Maybe the logger(1) command simply doesn’t work? Let’s give that a quick test:

[root@tmr6s14 log]# logger “I seem to be able to get messages to the log”
[root@tmr6s14 log]# tail -1 /var/log/messages
Jan 9 15:16:33 tmr6s14 root: I seem to be able to get messages to the log
[root@tmr6s14 log]# uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux

Interesting. Why don’t we see the string Oracle CSSD failure when Oracle fences then? It’s because the logger(1) command merely sends a message to syslogd(8) via a socket—and then it is off to the races. Again, back to the init.cssd script:

22 # FAST_REBOOT – take out the machine now. We are concerned about
23 # data integrity since the other node has evicted us.
[…] lines deleted
177 case $PLATFORM in
179 export LD_LIBRARY_PATH
180 FAST_REBOOT=”/sbin/reboot -n -f”

So at line 1040, the script sends a message to syslogd(8) and then immediately forces a reboot at line 1081—with the –n option to the reboot(8) command forcing a shutdown without sync(1). So there you have it, the text is drifting between the bash(1) context executing the init.cssd script and the syslogd(8) process that would do a buffered write anyway. I think the planets must really be in line for this text to ever get to the /var/log/messages file—and I think the google search for that particular string goes a long way towards backing up that notion. When I really want to see this string pop up in /var/log/messages, I fiddle with putting sync(1) comands and sleep before the line 1081. That is when I am, for instance, pulling physical connections from the Fibre Channel SAN paths and studying what Oracle behaves like by default.

By the way, the comments at lines 22-23 are the definition of ATONTRI.

I’ve never understood that paranoia at lines 1042-1043 which state:

We want to reboot here as fast as possible. It is imperative that we do not flush any IO to the shared disks.

It may sound a bit nit-picky, but folks this is RAC and there are no buffered writes to shared disk! No matter really, even if there was a sync(1) command at line 1080 in the init.cssd script, the likelihood of getting text to /var/log/messages is still going to be a race as I’ve pointed out.

Differences in
Google searches for fencing articles anchored with the Oracle CSSD failure string are about to get even more scarce. In, the text that the script attempts to send to the /var/log/messages file changed—the string no longer contains CSSD, but CRS instead. The following is a snippet from the init.cssd script shipped with

452 *)
453 $LOGERR “Oracle CRS failure. Rebooting for cluster integrity.”

A Workaround for a Red Hat 3 Problem in CRS
OK, this is interesting. In the init.cssd script, there is a workaround for some RHEL 3 race condition. I would be more specific about this, but I really don’t care about any problems init.cssd has in its attempt to perform fencing since for me the whole issue is moot. PolyServe is running underneath it and PolyServe is not going to fail a fencing operation. Nonetheless, if you are not on RHEL 3, and you deploy bare-bones Oracle-only RAC (e.g., no third party clusterware for fencing), you might take interest in this workaround since it could cause a failed fencing. That’s split-brain to you and I.

Just before the actual execution of the reboot(8) command, every Linux system running will now suffer the overhead of the code starting at line 489 shown in the snippet below. The builtin test of the variable $PLATFORM is pretty much free, but if for any reason you are on a RHEL 4, Novell SuSE SLES9 or even Oracle Enterprise Linux (who knows how they attribute versions to that) the code at line 491 is unnecessary and could put a full stop to the execution of this script if the server is in deep trouble—and remember fencings are suppose to handle deeply troubled servers.

Fiddle First, Fence Later
Yes, the test at line 491 is a shell builtin, no argument, but as line 226 shows, the shell command at line 491 is checking for the existence of the file /var/tmp/.orarblock. I haven’t looked, but bash(1) is most likely calling open(1) with O_CREAT and O_EXCL and returning true on test –e if the open(1) call gets EEXIST returned and false if not. In the end, however, if checking for the existence for a file in /var/tmp is proving difficult at the time init.cssd is trying to “fence” a server, this code is pretty dangerous since it can cause a failed fencing on a Linux RAC deployment. Further, at line 494 the script will need to open a file and write to it. All this on a server that is presumed sick and needs to get out of the cluster. Then again, who is to say that the bash process executing the init.cssd script is not totally swapped out permanently due to extreme low memory thrashing? Remember, servers being told to fence themselves (ATONTRI) are not healthy. Anyway, here is the relevant snippet of init.cssd:

226 REBOOTLOCKFILE=/var/tmp/.orarblock
484 # Workaround to Redhat 3 issue with multiple invocations of reboot.
485 # Here if oclsomon and ocssd are attempting a reboot at the same time
486 # then the kernel could lock up. Here we have a crude lock which
487 # doesn’t eliminate but drastically reduces the likelihood of getting
488 # two reboots at once.
489 if [ “$PLATFORM” = “Linux” ]; then
491 if [ -e “$REBOOTLOCKFILE” ]; then
493 fi
496 if [ ! -z “$CEDETO” ]; then
498 $LOGMSG “Oracle init script ceding reboot to sibling $CEDETO.”
499 fi
500 fi

The “Dread Factor”, Multi-vendor Support, Unbreakable Linux.

Dread the Possible, Ignore the Probable
“One throat to choke”, is the phrase I heard the last time I spoke with someone who went to extremes to reduce the number of technology providers in their production Oracle deployment. You know, Unbreakable Linux, single-source support provider, etc. I’m sorry, but, if you are running Oracle on Linux there is no way to get single-provider support. We all find this out sooner or later. Sure, you can send your money to a sole entity, but that is just a placebo. If I thought my life depended on single-provider support, I’d buy an IBM System i solution (AS400)—soup to nuts. At least I’d get close.

With Linux there is always going to be multiple providers because it runs on commodity hardware. You then add storage (SAN array, switches, HBAs), load the OS and Oracle and other software. There you go—multiple providers. So why is it that sometime people get a comfort from this theory of single-provider support on the software (OS and Oracle only of course) side of things? Is it a reality?

Dread Factor
No, single-provider support with Oracle on Linux is not a reality. That is why serious software providers and their careful customers rely on TSANet to ensure all parties play by the rules and do not start pointing fingers at the expense of the customer. Oracle is a participant in TSANet, so is PolyServe.

I was reading an interesting magazine article—also available online—about how we humans fear the wrong things. You know, things like fearing a commercial airliner fatality more than an auto fatality—the latter taking 500-fold more lives per year. The article explains why. We dread an airliner crash more. The article points out:

[…] the more we dread, the more anxious we get, and the more anxious we get, the less precisely we calculate the odds of the thing actually happening. “It’s called probability neglect,”

What Does This Have To Do With Oracle?
Well, we fear how “helpless” we might be in a case where the OS or third party platform software provider is pointing at Oracle and Oracle is pointing back. By the way, have you ever finger-pointed at a 800lb gorilla? Yes that is a possible scenario. Is that somehow more calamitous than working with Oracle on a clear, concise Oracle-only bug (e.g., some ORA-0600 crash problem)? Probably not, but fear of the former is an example of what the magazine article calls the Dread Factor.

New Year’s Resolution: Fear the Probable
We have a Wall Street customer that does not run Oracle on our Database Utility for Oracle RAC, in their RAC solution but do use our scalable file serving in their ETL workflow. They run Oracle on Itanium Linux and we don’t do Itanium. But, since we are in there, I know a bit about their operations. In the month of November 2006, one of their operations managers told me they had nearly 90 Oracle TARs open—half of which where ORA-00600/ORA-07445 problems. All those TARs were affecting a single application—a single RAC database. Yes, it is conceivable that they also have also faced a multi-vendor problem (e.g., HBA firmware/Red Hat SCSI midlayer) at some point in this deployment. Do you think they really care? In this shop, the database tier is 100% Unbreakable Linux—the old style, not the new style. The old style Unbreakable Linux being RHEL with Oracle and no third-party kernel loadable modules. That’s them–they have a “single throat to choke”. How do you think that is working out for them? It hasn’t made a bit of difference.

Oracle is an awesome database. It is huge and complex. You are going to hit bugs so it might be a good New Year’s resolution to fear the probable more then the possible. Get the most stable, managable, supported configuration you can so that you are not dealing with day to day headaches between those probable bugs. That is, don’t hinge your deployment on some possible support finger pointing match. Real, difficult, single-vendor bugs are most probable. Choose your partners well for those possible bugs.

A Case Study
The majority of the suse-oracle email list participants have the “no-third-party” model deployed. They are, if you will, the poster children for Unbreakable Linux. So I keep an eye out there to see how the theory plays out in reality. Let’s take a peek. In a recent thread about an Asynchronous I/O problem in the Linux kernel, the poster wrote:

We already tried this…opened a TAR with Oracle, opened an issue with Novell…got 2 fixes from Novell, but both are not helping around the bug. The database crashes after approx. 1 week of heavy load and you have to restart the machine to free the ipc-resources.

Remember that with an Unbreakable Linux deployment, if you hit a Linux kernel problem you can call Oracle or the provider of your Linux distribution. This person tried both, but the saga continued:

[…] we filed a bug…with both parties, Novell AND Oracle.We escalated this case at Novell, because it’s a kernel bug…no change for the last 4-6 weeks. But…as you see…no solution after about 3 months…

Since Linux is open source, the code is open to all for reading. I’ve blogged before about the dubious value in being able to read the source for the OS or layers such as clustered filesystems since an IT shop is not likely to fix the problem themselves anyway. The customer having this async I/O problem took advantage of that “benefit”:

I took a deep look into the kernel-code, especially the part of the bug in aio.c As far as i see, it looks like a list-corruption of the list of outstanding io-requests. So i don’t think that it is driver-specific…it looks like a general bug.

But, as I routinely point out, having the source really doesn’t help an IT shop much as this installment on the thread shows:

It’s very unfortunate that this bug (bz #165140) is still not resolved
as both Oracle and SUSE eng. teams are looking into problem.

An Historical Example of Good Multi-Vendor Support
Back in the 1990’s Veritas, Oracle and Sun got together to build a program called VOS to ensure their joint customers get the handling they deserve. Kudos to Oracle and Sun. That was typical of Oracle back in the Open Systems days. Things were a lot more “open” back then.

I participate in the oracle-l list. There was a recent thread there about the dreaded “finger-pointing” illusion. In this post a list participant set the record straight. His post points out that having more than “one through to choke” is better than being all alone:

In the context of clustering, even if you eliminate the third-party cluster-ware products, you still have the other pieces of the pie, like the OS, the storage (SAN, etc.), the interconnect, etc., so the finger-pointing will not go away. I have worked with the VOS support many times in the past and I can tell you that in each conference call, VERITAS support never pointed fingers towards anyone. In fact, their support people were so competent that they even identified issues that were related to SAN and even the analysts from the storage SAN company were not able to identify them.

Lessons From Real Life
Multi-vendor support is a phenomenon across all industries. A good friend of mine has a real job and does real work for a living—dangerous work, with huge dangerous equipment that he owns. He knows that there are certain things he has to do with his machinery that substantially increase the probability of something going wrong. In those cases, he doesn’t fret about the possibility that there may be some political outcome. He focuses on the probable.

A bit over a year ago he experienced “the probable” and took photos for me. While moving a 60,000+ lb piece of machinery, he hit a patch of ice and yes, 30 ton track vehicles do slide on ice just like your co-worker’s red sports car.

In the following shot, the machinery had just slipped off the road so he called in another of his pieces to help.


In the next shot they had worked at the problem until the tracks were headed in the right direction and the tether was freshly cut loose. He said the anxiety was so thick you could cut it with a knife. It is quite probable he is right. Then again, it is possible he was exaggerating. I’ll let you be the judge.

I’ll blog another time about where that machine had to go after that photo…it wasn’t pretty.

Partition, or Real Application Clusters Will Not Work.

OK, that was a come-on title. I’ll admit it straight away. You might find this post interesting nonetheless. Some time back, Christo Kutrovsky made a blog entry on the Pythian site about buffer cache analysis for RAC. I meant to blog about the post, but never got around to it—until today.

Christo’s entry consisted of some RAC theory and a buffer cache contents SQL query. I admit I have not yet tested his script against any of my RAC databases. I intend to do so soon, but I can’t right now because they are all under test. However, I wanted to comment a bit on Christo’s take on RAC theory. But first I’d like to comment about a statement in Christo’s post. He wrote:

There’s a caveat however. You have to first put your application in RAC, then the query can tell you how well it runs.

Not that Christo is saying so, but please don’t get into the habit of using scripts against internal performance tables as a metric of how “well” things are running. Such scripts should be used as tools to approach a known performance problem—a problem measured much closer to the user of the application. There are too many DBAs out there that run scripts way down-wind of the application and if they see such metrics as high hit ratios in cache, or other such metrics they rest on their laurels. That is bad mojo. It is not entirely unlikely that even a script like Christo’s could give a very “bad reading” yet application performance is satisfactory and vise versa. OK, enough said.

Application Partitioning with RAC
The basic premise Christo was trying to get across is that RAC works best when applications accessing the instances are partitioned in such a way as to not require cross-instance data shipping. Of course that is true, but what lengths do you really have to go to in order to get your money’s worth out of RAC? That is, we all recall how horrible block pings were with OPS—or do we? See, most people that loathed the dreaded block ping in OPS thought that the poison was in the disk I/O component of a ping when in reality the poison was in the IPC (both inter and intra instance IPC). OK, what am I talking about? It was quite common for a block ping in OPS to take on the order of 200-250 milliseconds on a system where disk I/O is being serviced with respectable times like 10ms. Where did the time go? IPC.

Remembering the Ping
In OPS, when a shadow process needed a block from another instance, there was an astounding amount of IPC involved to get the block from one instance to the other. In quick and dirty terms (this is just a brief overview of the life of a block ping) it consisted of the shadow process requesting the local LCK process to communicate with the remote LCK process who in turn communicated with the DBWR process on that node. That DBWR process then flushed the required block (along with all the modified blocks covered by the same PCM lock) to disk. That DBWR then posted his local LCK who in turn posted the LCK process back where the original requesting shadow process is waiting. That LCK then posts the shadow process and the shadow process then reads the block from disk. Whew. Note, at every IPC point the act of messaging only makes the process being posted runable. It then waits in line for CPU in accordance with its mode and priority. Also, when DBWR is posted on the holding node, it is unlikely that it was idle, so the life of the block ping event also included some amount of time that was spent while DBWR finished servicing the SGA flushing it was already doing when it got posted. All told, there was quite often some 20 points where the processes involved were in runable states. Considering the time quantum for scheduling is/was 10ms, you routinely got as much as 200ms overhead on a block ping that was just scheduling delay. What a drag.

What Does This Have To Do With RAC?
Christo’s post discusses divide and conquer style RAC partitioning, and he is right. If you want RAC to perform perfectly for you, you have to make sure that RAC isn’t being used. Oh he’s gone off the deep end again you say. No, not really. What I’m saying is that if you completely partition your workload then RAC is indeed not really being used. I’m not saying Christo is suggesting you have to do that. I am saying, however, you don’t have to do that. This blog post is not just a shill for Cache Fusion, but folks, we are not talking about block pings here. Cache Fusion—even over Gigabit Ethernet—is actually quite efficient. Applications can scale fairly well with RAC without going to extreme partitioning efforts. I think the best message is that application partitioning should be looked at as a method of exploiting this exorbitantly priced stuff you bought. That is, in the same way we try to exploit the efficiencies gained by fundamental SMP cache-affinity principals, so should attempts be made to localize demand for tables and indexes (and other objects) to instances—when feasible. If it is not feasible to do any application partitioning, and RAC isn’t scaling for you, you have to get a bigger SMP. Sorry. How often do I see that? Strangely not that often. Why?

I can’t count how often I see production RAC instances running throughout an entire RAC cluster at processor utilization levels well below 50%. And I’m talking about RAC deployments where no attempt has been made to partition the application. These sites often don’t need to consider such deployment tactics because the performance they are getting is meeting their requirements. I do cringe and bite my tongue however when I see 2 instances of RAC in a two node cluster—void of any application partitioning—running at, say, 40% processor utilization on each node. If no partitioning effort has been made, that means there is cache fusion (GCS/GES) in play—and lots of it. Deployments like that are turning their GbE Cache Fusion interconnect into an extension of the system bus if you will. If I was the administrator of such a setup, I’d ask Santa to scramble down the chimney and pack that entire workload into one server at roughly 80% utilization. But that’s just me. Oh, actually, packing two 40% RAC workloads back into a single server doesn’t necessarily produce 80% utilization. There is more to it than that. I’ll see if I can blog about that one too at some point.

What about High-Speed, Low-Latency Interconnects?
With OLTP, if the processors are saturated on the RAC instances you are trying to scale, high-speed/low latency interconnect will not buy you a thing. Sorry. I’ll blog about why in another post.

Final Thought
If you are one of the few out there that find yourself facing a total partitioning exercise with RAC, why not deploy a larger SMP instead? Comments?

Testing RAC Failover: Be Evil, Make New Friends.

In Alejandro Vargas’ blog entry about RAC & ASM, Crash and Recovery Test Scenarios, some tests were described that would cause RAC failovers. Unfortunately, none of the faults described were the of the sort that put clusterware to the test. The easiest types of failures for clusterware to handle are complete, clean outages. Simply powering of a server, for instance, is no challenge for any clusterware to deal with. The other nodes in the cluster will be well aware that the node is dead. The difficult scenarios for clusterware to respond to are states of flux and compromised participation in the cluster. That is, a server that is alive but not participating. The topic of Alejandro’s blog entry was not a definition of a production readiness testing plan by any means, but it was a good segway into the comment I entered:

These are good tests, yes, but they do not truly replicate difficult scenarios for clusterware to resolve. It is always important to perform manual fault-injection testing such as physically severing storage and network connectivity paths and doing so with simultaneous failures and cascading failures alike. Also, another good test to [run] is a forced processor starvation situation by forking processes in a loop until there are no [process] slots [remaining]. These […] situations are a challenge to any clusterware offering.

Clusterware is Serious Business
As I pointed out in my previous blog entry about Oracle Clusterware, processor saturation is a bad thing for Oracle Clusterware—particularly where fencing is concerned. Alejandro had this to say:

These scenarios were defined to train a group of DBA’s to perform recovery, rather than to test the clusterware itself. When we introduced RAC & ASM we did run stress & resilience tests. The starvation test you suggest is a good one, I have seen that happening at customer sites on production environments. Thanks for your comments.

Be Mean!
If you are involved with a pre-production testing effort involving clustered Oracle, remember, be evil! Don’t force failover by doing operational things like shutting down a server or killing Oracle clusterware processes. You are just doing a functional test when you do that. Instead, create significant server load with synthetic tests such as wild loops of dd(1) to /dev/null using absurdly large values assigned to the ibs argument or shell scripts that fork children but don’t wait for them. Run C programs that wildly malloc(2) memory, or maybe a little stack recursion is your flavor—force the system into swapping, etc. Generate these loads on the server you are about to isolate from the network for instance. See what the state of the cluster is afterwards. Of course, you can purposefully execute poorly tuned Parallel Query workloads to swamp a system as well. Be creative.

Something To Think About
For once, it will pay off to be evil. Just make sure whatever you accept as your synthetic load generator is consistent and reproducible because once you start this testing, you’ll be doing it again and again—if you find bugs. You’ll be spending a lot of time on the phone making new friends.

RAC Expert or Clusters Expert?

Introducing the Oracle SMP Expert. What is a Spinlock?
I am not joking when I tell you that I met an individual last year that billed himself as an “Oracle SMP expert.” That is fine and dandy, but through the course of our discussion I realized that this person had a severely limited understanding of the most crucial concept in SMP software scalability—critical sections. It wasn’t necessarily the concept of critical sections this individual didn’t really understand, it was the mutual exclusion that must accompany critical sections on SMP systems. In Oracle terms, this person could not deliver a coherent definition for what a latch is—that is, he didn’t understand what a spinlock was and how Oracle implements them. An “Oracle SMP expert” that lacks even cursory understanding of mutual exclusion principles is an awful lot like a “RAC expert” that does not have a firm understanding of what the term “fencing” means.

I have met a lot of “RAC experts” in the last 5 years who lack understanding of clusters principles—most notably what the term “fencing” is and what it means to RAC. Fencing is to clusters what critical sections are to SMP scalability.

Is it possible to be a “RAC expert” without being a cluster expert? The following is a digest of this paper about clusterware I have posted on the Oaktable Network website. For that matter, Julian Dyke and Steve Shaw accepted some of this information for inclusion in this RAC book.

Actually, I think getting it in their book was a part of the bribe for the technical review I did of the book (just joking).

I Adore RAC and Fencing is a Cool Sport!
No, not that kind of fencing. Fencing is a generic clustering term relating to how a cluster handles nodes that should no longer have access to shared resources such as shared disk. For example, if a node in the cluster has access to shared disk but has no functioning interconnects; it really no longer belongs in the cluster. There are several different types of fencing. The most common type came from academia and is referred to by the acronym STOMITH which stands for Shoot The Other Machine In The Head. A more popular variant of this acronym is STONITH where “N” stands for Node. While STONITH is a common term, there is nothing common with how it is implemented. The general idea is that the healthy nodes in the cluster are responsible for determining that an unhealthy node should no longer be in the cluster. Once such a determination is made, a healthy node takes action to power cycle the errant node. This can be done with network power switches for example. All told, STONITH is a “good” approach to fencing because it is generally built upon the notion that healthy nodes monitor and take action to fence unhealthy nodes.

This differs significantly from the “fencing” model implemented in Oracle Clusterware, which doesn’t implement STONITH at all. In Oracle Clusterware, nodes fence themselves by executing the reboot(8) command out of the /etc/init.d/init.cssd. This is a very portable approach to “fencing”, but it raises the question of what happens if the node is so unhealthy that it cannot successfully execute the reboot(8) command. Certainly we’ve all experienced systems that were so incapacitated that commands no longer executed (e.g., complete virtual memory depletion, etc.). In a cluster it is imperative that nodes be fenced when needed, otherwise they can corrupt data. After all, there is a reason the node is being fenced. Having a node with active I/O paths to shared storage after it is supposed to be fenced from the cluster is not a good thing.

Oracle Clusterware and Vendor Clusterware in Parallel
On all platforms, except Linux and Windows, Oracle Clusterware can execute in an integrated fashion with the host clusterware. An example of this would be Oracle10g using the libskgx[n/p] libraries supplied by HP for the MC ServiceGuard environment. When Oracle runs with integrated vendor clusterware, Oracle makes calls to the vendor-supplied library to perform fencing operations. This blog post is about Linux, so the only relationship between vendor clusterware and Oracle clusterware is when Oracle-validated compatible clusterware runs in parallel with Oracle Clusterware. One such example of this model is Oracle10g RAC on PolyServe Matrix Server.

In situations where Oracle’s fencing mechanism is not able to perform its fencing operation, the underlying validated host clusterware will fence the node, as is the case with PolyServe Matrix Server. It turns out that the criteria used by Oracle Clusterware to trigger fencing are the same criteria that host clusterware uses to take action. Oracle instituted the Vendor Clusterware Validation suites to ensure that underlying clusterware is compatible and complements Oracle clusterware. STONITH is one form of fencing, but far from the only one. PolyServe supports a sophisticated form of STONITH where the healthy nodes integrate with management interfaces such as Hewlett-Packard’s iLO (Integrated Lights-Out) and Dell DRAC. Here again, the most important principle of clustering is implemented—healthy nodes take action to fence unhealthy nodes— which ensures that the fencing will occur. This form of STONITH is more sophisticated than the network power-switch approach, but in the end they do the same thing—both approaches power-cycle unhealthy nodes. However, it is not always desirable to have an unhealthy server power-cycled just for the sake of fencing.

Fabric Fencing
With STONITH, there could be helpful state information lost in the power reset. Losing that information may make cluster troubleshooting quite difficult. Also, if the condition that triggered the fencing persists across reboots, a “reboot loop” can occur. For this reason, PolyServe implements Fabric Fencing as the preferred option for customers running Real Application Clusters. Fabric Fencing is implemented in the PolyServe SAN management layer. PolyServe certifies a comprehensive list of Fiber Channel switches that are tested with the Fabric Fencing code. All nodes in a PolyServe cluster have LAN connectivity to the Fiber Channel switches. With Fabric Fencing, healthy nodes make SNMP calls to the Fiber Channel switch to disable all SAN access from unhealthy nodes. This form of fencing is built upon the sound principle of having healthy servers fence unhealthy servers, but the fenced server is left in an “up” state—yet completely severed from shared disk access. Administrators can log into it, view logs and so on, but before the node can rejoin the cluster, it must be rebooted.

Kernel Mode Clusterware
The most important aspect of host clusterware, such as PolyServe, is that it is generally implemented in Kernel Mode. In the case of PolyServe, the most critical functionality of SAN management, cluster filesystem, volume manager and so on are implemented in Kernel Mode on both Linux and Windows. On the other hand, when fencing code is implemented in User Mode, there is always the risk that the code will not get processor cycles to execute. Indeed, with clusters in general, overly saturated nodes often need to be fenced because they are not responding to status requests by other nodes in the cluster. When nodes in the cluster are getting so saturated as to trigger fencing, having critical clusterware code execute in Kernel Mode is a higher level of assurance that the fencing operation will succeed. That is, if all the nodes in the cluster are approaching a critical state and a fencing operation is necessary against an errant node, having Kernel Mode fencing architected as either robust STONITH or Fabric Fencing ensures the correct action will take place.

Coming Soon
What about SCSI-III Persistent Reservation. Isn’t I/O fencing as good as server fencing? No, it isn’t.


I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 747 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: