Real Application Clusters | Kevin Closson's Blog: Platforms, Databases and Storage

In Alejandro Vargas’ blog entry about RAC & ASM, Crash and Recovery Test Scenarios, some tests were described that would cause RAC failovers. Unfortunately, none of the faults described were the of the sort that put clusterware to the test. The easiest types of failures for clusterware to handle are complete, clean outages. Simply powering of a server, for instance, is no challenge for any clusterware to deal with. The other nodes in the cluster will be well aware that the node is dead. The difficult scenarios for clusterware to respond to are states of flux and compromised participation in the cluster. That is, a server that is alive but not participating. The topic of Alejandro’s blog entry was not a definition of a production readiness testing plan by any means, but it was a good segway into the comment I entered:

These are good tests, yes, but they do not truly replicate difficult scenarios for clusterware to resolve. It is always important to perform manual fault-injection testing such as physically severing storage and network connectivity paths and doing so with simultaneous failures and cascading failures alike. Also, another good test to [run] is a forced processor starvation situation by forking processes in a loop until there are no [process] slots [remaining]. These […] situations are a challenge to any clusterware offering.

Clusterware is Serious Business
As I pointed out in my previous blog entry about Oracle Clusterware, processor saturation is a bad thing for Oracle Clusterware—particularly where fencing is concerned. Alejandro had this to say:

These scenarios were defined to train a group of DBA’s to perform recovery, rather than to test the clusterware itself. When we introduced RAC & ASM we did run stress & resilience tests. The starvation test you suggest is a good one, I have seen that happening at customer sites on production environments. Thanks for your comments.

Be Mean!
If you are involved with a pre-production testing effort involving clustered Oracle, remember, be evil! Don’t force failover by doing operational things like shutting down a server or killing Oracle clusterware processes. You are just doing a functional test when you do that. Instead, create significant server load with synthetic tests such as wild loops of dd(1) to /dev/null using absurdly large values assigned to the ibs argument or shell scripts that fork children but don’t wait for them. Run C programs that wildly malloc(2) memory, or maybe a little stack recursion is your flavor—force the system into swapping, etc. Generate these loads on the server you are about to isolate from the network for instance. See what the state of the cluster is afterwards. Of course, you can purposefully execute poorly tuned Parallel Query workloads to swamp a system as well. Be creative.

Something To Think About
For once, it will pay off to be evil. Just make sure whatever you accept as your synthetic load generator is consistent and reproducible because once you start this testing, you’ll be doing it again and again—if you find bugs. You’ll be spending a lot of time on the phone making new friends.

Introducing the Oracle SMP Expert. What is a Spinlock?
I am not joking when I tell you that I met an individual last year that billed himself as an “Oracle SMP expert.” That is fine and dandy, but through the course of our discussion I realized that this person had a severely limited understanding of the most crucial concept in SMP software scalability—critical sections. It wasn’t necessarily the concept of critical sections this individual didn’t really understand, it was the mutual exclusion that must accompany critical sections on SMP systems. In Oracle terms, this person could not deliver a coherent definition for what a latch is—that is, he didn’t understand what a spinlock was and how Oracle implements them. An “Oracle SMP expert” that lacks even cursory understanding of mutual exclusion principles is an awful lot like a “RAC expert” that does not have a firm understanding of what the term “fencing” means.

I have met a lot of “RAC experts” in the last 5 years who lack understanding of clusters principles—most notably what the term “fencing” is and what it means to RAC. Fencing is to clusters what critical sections are to SMP scalability.

Is it possible to be a “RAC expert” without being a cluster expert? The following is a digest of this paper about clusterware I have posted on the Oaktable Network website. For that matter, Julian Dyke and Steve Shaw accepted some of this information for inclusion in this RAC book.

Actually, I think getting it in their book was a part of the bribe for the technical review I did of the book (just joking).

I Adore RAC and Fencing is a Cool Sport!
No, not that kind of fencing. Fencing is a generic clustering term relating to how a cluster handles nodes that should no longer have access to shared resources such as shared disk. For example, if a node in the cluster has access to shared disk but has no functioning interconnects; it really no longer belongs in the cluster. There are several different types of fencing. The most common type came from academia and is referred to by the acronym STOMITH which stands for Shoot The Other Machine In The Head. A more popular variant of this acronym is STONITH where “N” stands for Node. While STONITH is a common term, there is nothing common with how it is implemented. The general idea is that the healthy nodes in the cluster are responsible for determining that an unhealthy node should no longer be in the cluster. Once such a determination is made, a healthy node takes action to power cycle the errant node. This can be done with network power switches for example. All told, STONITH is a “good” approach to fencing because it is generally built upon the notion that healthy nodes monitor and take action to fence unhealthy nodes.

This differs significantly from the “fencing” model implemented in Oracle Clusterware, which doesn’t implement STONITH at all. In Oracle Clusterware, nodes fence themselves by executing the reboot(8) command out of the /etc/init.d/init.cssd. This is a very portable approach to “fencing”, but it raises the question of what happens if the node is so unhealthy that it cannot successfully execute the reboot(8) command. Certainly we’ve all experienced systems that were so incapacitated that commands no longer executed (e.g., complete virtual memory depletion, etc.). In a cluster it is imperative that nodes be fenced when needed, otherwise they can corrupt data. After all, there is a reason the node is being fenced. Having a node with active I/O paths to shared storage after it is supposed to be fenced from the cluster is not a good thing.

Oracle Clusterware and Vendor Clusterware in Parallel
On all platforms, except Linux and Windows, Oracle Clusterware can execute in an integrated fashion with the host clusterware. An example of this would be Oracle10g using the libskgx[n/p] libraries supplied by HP for the MC ServiceGuard environment. When Oracle runs with integrated vendor clusterware, Oracle makes calls to the vendor-supplied library to perform fencing operations. This blog post is about Linux, so the only relationship between vendor clusterware and Oracle clusterware is when Oracle-validated compatible clusterware runs in parallel with Oracle Clusterware. One such example of this model is Oracle10g RAC on PolyServe Matrix Server.

In situations where Oracle’s fencing mechanism is not able to perform its fencing operation, the underlying validated host clusterware will fence the node, as is the case with PolyServe Matrix Server. It turns out that the criteria used by Oracle Clusterware to trigger fencing are the same criteria that host clusterware uses to take action. Oracle instituted the Vendor Clusterware Validation suites to ensure that underlying clusterware is compatible and complements Oracle clusterware. STONITH is one form of fencing, but far from the only one. PolyServe supports a sophisticated form of STONITH where the healthy nodes integrate with management interfaces such as Hewlett-Packard’s iLO (Integrated Lights-Out) and Dell DRAC. Here again, the most important principle of clustering is implemented—healthy nodes take action to fence unhealthy nodes— which ensures that the fencing will occur. This form of STONITH is more sophisticated than the network power-switch approach, but in the end they do the same thing—both approaches power-cycle unhealthy nodes. However, it is not always desirable to have an unhealthy server power-cycled just for the sake of fencing.

Fabric Fencing
With STONITH, there could be helpful state information lost in the power reset. Losing that information may make cluster troubleshooting quite difficult. Also, if the condition that triggered the fencing persists across reboots, a “reboot loop” can occur. For this reason, PolyServe implements Fabric Fencing as the preferred option for customers running Real Application Clusters. Fabric Fencing is implemented in the PolyServe SAN management layer. PolyServe certifies a comprehensive list of Fiber Channel switches that are tested with the Fabric Fencing code. All nodes in a PolyServe cluster have LAN connectivity to the Fiber Channel switches. With Fabric Fencing, healthy nodes make SNMP calls to the Fiber Channel switch to disable all SAN access from unhealthy nodes. This form of fencing is built upon the sound principle of having healthy servers fence unhealthy servers, but the fenced server is left in an “up” state—yet completely severed from shared disk access. Administrators can log into it, view logs and so on, but before the node can rejoin the cluster, it must be rebooted.

Kernel Mode Clusterware
The most important aspect of host clusterware, such as PolyServe, is that it is generally implemented in Kernel Mode. In the case of PolyServe, the most critical functionality of SAN management, cluster filesystem, volume manager and so on are implemented in Kernel Mode on both Linux and Windows. On the other hand, when fencing code is implemented in User Mode, there is always the risk that the code will not get processor cycles to execute. Indeed, with clusters in general, overly saturated nodes often need to be fenced because they are not responding to status requests by other nodes in the cluster. When nodes in the cluster are getting so saturated as to trigger fencing, having critical clusterware code execute in Kernel Mode is a higher level of assurance that the fencing operation will succeed. That is, if all the nodes in the cluster are approaching a critical state and a fencing operation is necessary against an errant node, having Kernel Mode fencing architected as either robust STONITH or Fabric Fencing ensures the correct action will take place.

Coming Soon
What about SCSI-III Persistent Reservation. Isn’t I/O fencing as good as server fencing? No, it isn’t.

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'Real Application Clusters' Category

Testing RAC Failover: Be Evil, Make New Friends.

RAC Expert or Clusters Expert?

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright