Oracle Clusterware and Fencing…Again?

Coming back from vacation and failing to catch up on oracle-l list topics is a bad mistake. I was wondering why Kirk McGowan decided to make a post about fencing in the context of Oracle Clusterware. After finally catching up on my oracle-l backlog, I see that the stimulus for Kirk’s blog entry was likely this post to the oracle-l list where the list member was asking whether Oracle Clusterware implements STONITH as its fencing model. It seems the question was asked after the list member watched this Oracle webcast about RAC where slide 11 specifically states:

IO Fencing via Stonith algorithm (remote power reset)

The list member was conflicted over the statement in Oracle’s webcast. It seems he had likely seen my blog entry entitled RAC Expert or Clusters Expert where I discuss the clusters concept of fencing. In that blog entry, and in the paper I reference therein, I point out that Oracle Clusterware doesn’t implement STONITH because it doesn’t. Oh boy, there he goes again contradicting Oracle. Well, no, I’m not. The quote from Oracle’s webcast says they implement their fencing using a “STONITH algorithm” and they do. The bit about remote power reset is splitting hairs a bit since the way the fenced node excuses itself from the cluster is by executing an immediate shutdown (e.g., Linux reboot(8) command). Kirk correctly points out that the correct term is actually suicide. Oracle uses algorithms common to STONITH implementations to determine what nodes need to get fenced. When a node is alerted that it is being “fenced” it uses suicide to carry out the order.

What Time Is It?
From about 2003 through 2005 I had dozens of people ask me for in-depth clusters concepts information with both a generic view and an Oracle-centric view-I was working for a clustering company and had a long history of clustered Oracle behind me after all. It seems people were getting confused as to why there were options to use vendor-integrated host clusterware on all platforms except Linux and Windows. People wanted to better understand both generic clustering concepts as well as Oracle Clusterware. It seems some merely wanted to “know what time it is” while others wanted to “know how to tell time” and some even wanted to know “how the clock works.” About the time Oracle implemented the Third Party Clusterware Validation Program I decided I need to write a paper on the matter, so I did and posted it on the OakTable Network site. In the paper, and my blog post, I point out that Oracle Clusterware is not STONITH-clinically speaking-and indeed it isn’t. STONITH requires healthy servers to take action against ill servers via:

  • Remote Power Reset. This technology is not expensive, nor spooky. In fact, here is a network power switch for $199 that allows SNMP commands to power cycle outlets. Academic (and some commercial) approaches use these sorts of devices when implementing clusters. A healthy server will simply issue an SNMP command to power off the ill server. Incidentally, not all servers that run Oracle even have a power cord (think blades) and some don’t even use AC (see Rackable’s DB Power servers) so Oracle couldn’t use this approach without horrible platform-specific porting issues.
  • Remote System Management. There are a plethora of remote system management technologies (e.g., power cycle a server remotely) such as DRAC, IPMI, iLO, ALOM, RSC. Oracle is not crazy enough to tailor their fencing requirements around each of these. What a porting nightmare that would be. Oracle has stated more than once that there are no standards in this space and thus no useable APIs. The closest thing would have been either IPMI or OPMA, but the industry hasn’t seemed to want a cross-platform standard in this space.

The lack of standards where cluster fencing is concerned leaves us with a wide array of vendor clusterware such as Service Guard, HACMP, VCS, PolyServe, Red Hat Cluster Suite and on and on. I had a lot of Oracle customers asking me to inform them of the fundamental differences between these various clusterware and Oracle’s Clusterware so I did.

Gasp, Oracle Doesn’t Implement STONITH!
Henny penny: the sky is falling. So Oracle doesn’t really implement STONITH. So what. That doesn’t mean nobody wants to understand the general topic of clustering-and fencing in particular-a little bit better. It would not be right to tell them that their quest for information is moot just because other cluster approaches are not embodied in Oracle Clusterware. However, the importance of Oracle’s choice of fencing method is probably summed up the best in that oracle-l email thread which dried up and died within 24 hours after another member posted the following:

Has anyone see a RAC data corruption due to Clusterware unable to shoot itself?

I can assure you all that if anyone reading the oracle-l list had such a testimonial we would have heard it. The oracle-l list membership is huge and there are also a lot of consultants on the list who have contacts with a lot of production sites. The thread dried up, dropped to the ground and died. I think what I just wrote mirrors Kirk McGowan’s position on the matter.

What Would It Take
No clustering approach is perfect. Whether STONITH, fabric fencing or suicide, clusters can melt down. That is, after all, why Oracle offers an even higher level of protection in their Maximum Availability Architecture through such technology as DataGuard.

What would it take for an Oracle Clusterware fencing breach and why would I blog such taboo? It takes a lot of unlikely (yet possible) circumstances and because some people want to know. With Oracle Clusterware, a fencing breach would require:

  • Failed Suicide. If for any reason Oracle’s Clusterware process is not able to successfully execute a software reboot of the ill server.
  • Hangcheck Failure. The hangcheck kernel module executes off a kernel timer. If the system is so ill that these kernel events are not getting triggered then that would mean hardclock interrupts are not working and I should think the system would likely PANIC. All told a PANIC is just as good as hangcheck timer succeeding really. Nonetheless, it is possible that such a situation could arise.

A Waste of My Time
So over the last few years I spent a little time explaining clusters concepts to people with Oracle in mind. In my writings I discussed such topics as fencing, kernel mode/user mode clusterware, skgxp(), skgxn() and a host of other RAC-related material. Was it a waste of my time? No. Do I agree with Kirk McGowan’s post? Yes. Most importantly, however, I hang my hat on the oracle-l thread that dead-ended when the last poster on the thread asked:

Has anyone see a RAC data corruption due to Clusterware unable to shoot itself?

…and then there was silence.

4 Responses to “Oracle Clusterware and Fencing…Again?”

  1. 1 Dan Norris August 17, 2007 at 7:26 pm

    Thanks for rehashing and summarizing this topic, Kevin. As I mentioned to Kirk, I don’t think there are very many concise descriptions about Oracle Clusterware’s fencing mechanisms and the challenges surrounding fencing.

    I also wanted to point out that the webcast you referenced is one that was for the Oracle RAC SIG ( and there are many more recorded webcasts there available for on-demand playback. Membership is free–just sign up on the site.

    Thanks for the high-quality, informative posting, as usual!

    Dan Norris
    RAC SIG Events Chair

  2. 2 kevinclosson August 17, 2007 at 8:06 pm

    Thanks for stopping by, Dan!

  3. 3 lscheng August 17, 2007 at 11:08 pm

    I think we all noted Kirk said this:

    “Note that this discussion is focused solely on Oracle Clusterware used in conjunction with RAC.”

    So CRS is perfect for RAC so far, may be the kernel/user mode clusterware flaws applies for those who use CRS for other purposes…. Who knows 😉


    LiShan Cheng

  4. 4 Pedro Alvarez August 18, 2007 at 5:49 pm


    It was me who posted that. My manager wanted me to dig me out some information about the oracle clusterware. Our shop got 80 percent of unstructured data. We wanted to have a cluster file system and cluster apps (the way done on VCS), not to deal with raw disks or ASM. Therefore, we wanted to know how fencing is done in the oracle clusterware. And the oracle sales guy was not even responding at all.

    Given you blogged many times about fencing, I thought of posting Oracle presentation–esp about stonith. I would not have posted it on oracle-l, if they had a decent documentation

    Thanks for your blog, again.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.


I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 744 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: