BLOG UPDATE 2011.08.11 : For years my criticism of Oracle Clusterware fencing methodology brought ire from many who were convinced I was merely a renegade. The ranks of “the many” in this case were generally well-intended but overly convinced that Oracle was the only proven clustering technology in existence. It took many years for Oracle to do so, but they did finally offer support for IPMI fencing integration in the 11.2 release of Oracle Database. It also took me a long time to get around to updating this post. Whether by graces of capitulation or a reinvention of the wheel, you too can now, finally, enjoy a proper fencing infrastructure. For more information please see: http://download.oracle.com/docs/cd/E11882_01/rac.112/e16794/admin.htm#CHDGIAAD
I’ve covered the clusters concept of fencing quite a bit on this blog (e.g., RAC Expert or Clusters Expert and Now is the Time to Open Source, etc), and in papers such as this paper about clusterware, and in an appendix in the Julian Dyke/Steve Shaw book about RAC on Linux. If I’ve said it once, I’ve said it 1000 times; if you are not a clusters expert you cannot be a RAC expert. Oddly though, Oracle seems to be sending a message that clusterware is commoditized—and it really isn’t. On the other hand, Oracle was brilliant for heading down the road of providing their own clusterware. Until all the kinks are worked out, it is good to know as much as you can about what is under the covers.
Linux RAC “Fencing”
As I’ve pointed out in the above referenced pieces, Oracle “fencing” is not implemented by healthy servers taking action against rogue servers (e.g., STONITH), but instead the server that needs to be “fenced” is sent a message. With that message, the sick server will then reboot itself. Of course, a sick server might not be able to reboot itself. I call this form of fencing ATONTRI (Ask The Other Node To Reboot Itself).This blog entry is not intended to bash Oracle clusterware “fencing”—it is what it is, works well and for those who choose there is the option of running integrated Legacy clusterware or validated third party clusterware to fill in the gaps. Instead, I want to blog about a couple of interesting observations and then cover some changes that were implemented to the Oracle init.cssd script under 10.2.0.3 that you need to be aware of.
Logging When Oracle “Fences” a Server
As I mentioned in this blog entry about the 10.2.0.3 CRS patchset, I found 10.2.0.1 CRS—or is that “clusterware”—to be sufficiently stable to just skip over 10.2.0.2. So what I’m about to point out might be old news to you folks. The logging text produced by Oracle clusterware changed between 10.2.0.1 and 10.2.0.3. But, since CRS has a fundamental flaw in the way it logs this text, you’d likely never know it.
Lot’s of Looking Going On
As an aside, one of the cool things about bloggingis that I get to track the search terms folks use to get here. Since the launch of my blog, I’ve had over 11000 visits from readers looking for information about the most common error message returned if you have a botched CRS install on Linux—that text being:
PROT-1: Failed to initialize ocrconfig
No News Must Be Good News
I haven’t yet blogged about the /var/log/messages entry you are supposed to see when Oracle fences a server, but if I had, I don’t think it would be a very common google search string anyway? No the reason isn’t that Oracle so seldomly needs to fence a server. The reason is that the text generally (nearly never actually) doesn’t make it into the system log. Let’s dig into this topic.
The portion of the init.cssd script that acts as the “fencing” agent in 10.2.0.1 is coded to produce the following entry in the /var/log/messages file via the Linux logger(1) command (line numbers precede code):
1040 $LOGERR “Oracle CSSD failure. Rebooting for cluster integrity.”
1042 # We want to reboot here as fast as possible. It is imperative
1043 # that we do not flush any IO to the shared disks. Choosing not
1044 # to flush local disks or kill off processes gracefully shuts
1045 # us down quickly.
1081 $EVAL $REBOOT_CMD
Let’s think about this for a moment. If Oracle needs to “fence” a server, the server that is being fenced should produce the followingtext in /var/log/messages:
Oracle CSSD failure.Rebooting for cluster integrity.
Why is it when I google for “Oracle CSSD failure.Rebooting for cluster integrity” I get 3, count them, 3 articles returned? Maybe the logger(1) command simply doesn’t work? Let’s give that a quick test:
[root@tmr6s14 log]# logger “I seem to be able to get messages to the log”
[root@tmr6s14 log]# tail -1 /var/log/messages
Jan 9 15:16:33 tmr6s14 root: I seem to be able to get messages to the log
[root@tmr6s14 log]# uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
Interesting. Why don’t we see the string Oracle CSSD failure when Oracle fences then? It’s because the logger(1) command merely sends a message to syslogd(8) via a socket—and then it is off to the races. Again, back to the 10.2.0.1 init.cssd script:
22 # FAST_REBOOT – take out the machine now. We are concerned about
23 # data integrity since the other node has evicted us.
[…] lines deleted
177 case $PLATFORM in
178 Linux) LD_LIBRARY_PATH=$ORA_CRS_HOME/lib
179 export LD_LIBRARY_PATH
180 FAST_REBOOT=”/sbin/reboot -n -f”
So at line 1040, the script sends a message to syslogd(8) and then immediately forces a reboot at line 1081—with the –n option to the reboot(8) command forcing a shutdown without sync(1). So there you have it, the text is drifting between the bash(1) context executing the init.cssd script and the syslogd(8) process that would do a buffered write anyway. I think the planets must really be in line for this text to ever get to the /var/log/messages file—and I think the google search for that particular string goes a long way towards backing up that notion. When I really want to see this string pop up in /var/log/messages, I fiddle with putting sync(1) comands and sleep before the line 1081. That is when I am, for instance, pulling physical connections from the Fibre Channel SAN paths and studying what Oracle behaves like by default.
By the way, the comments at lines 22-23 are the definition of ATONTRI.
I’ve never understood that paranoia at lines 1042-1043 which state:
We want to reboot here as fast as possible. It is imperative that we do not flush any IO to the shared disks.
It may sound a bit nit-picky, but folks this is RAC and there are no buffered writes to shared disk! No matter really, even if there was a sync(1) command at line 1080 in the 10.2.0.1 init.cssd script, the likelihood of getting text to /var/log/messages is still going to be a race as I’ve pointed out.
Differences in 10.2.0.3
Google searches for fencing articles anchored with the Oracle CSSD failure string are about to get even more scarce. In 10.2.0.3, the text that the script attempts to send to the /var/log/messages file changed—the string no longer contains CSSD, but CRS instead. The following is a snippet from the init.cssd script shipped with 10.2.0.3:
453 $LOGERR “Oracle CRS failure. Rebooting for cluster integrity.”
A Workaround for a Red Hat 3 Problem in 10.2.0.3 CRS
OK, this is interesting. In the 10.2.0.3 init.cssd script, there is a workaround for some RHEL 3 race condition. I would be more specific about this, but I really don’t care about any problems init.cssd has in its attempt to perform fencing since for me the whole issue is moot. PolyServe is running underneath it and PolyServe is not going to fail a fencing operation. Nonetheless, if you are not on RHEL 3, and you deploy bare-bones Oracle-only RAC (e.g., no third party clusterware for fencing), you might take interest in this workaround since it could cause a failed fencing. That’s split-brain to you and I.
Just before the actual execution of the reboot(8) command, every Linux system running 10.2.0.3 will now suffer the overhead of the code starting at line 489 shown in the snippet below. The builtin test of the variable $PLATFORM is pretty much free, but if for any reason you are on a RHEL 4, Novell SuSE SLES9 or even Oracle Enterprise Linux (who knows how they attribute versions to that) the code at line 491 is unnecessary and could put a full stop to the execution of this script if the server is in deep trouble—and remember fencings are suppose to handle deeply troubled servers.
Fiddle First, Fence Later
Yes, the test at line 491 is a shell builtin, no argument, but as line 226 shows, the shell command at line 491 is checking for the existence of the file /var/tmp/.orarblock. I haven’t looked, but bash(1) is most likely calling open(1) with O_CREAT and O_EXCL and returning true on test –e if the open(1) call gets EEXIST returned and false if not. In the end, however, if checking for the existence for a file in /var/tmp is proving difficult at the time init.cssd is trying to “fence” a server, this code is pretty dangerous since it can cause a failed fencing on a Linux RAC deployment. Further, at line 494 the script will need to open a file and write to it. All this on a server that is presumed sick and needs to get out of the cluster. Then again, who is to say that the bash process executing the init.cssd script is not totally swapped out permanently due to extreme low memory thrashing? Remember, servers being told to fence themselves (ATONTRI) are not healthy. Anyway, here is the relevant snippet of 10.2.0.3 init.cssd:
484 # Workaround to Redhat 3 issue with multiple invocations of reboot.
485 # Here if oclsomon and ocssd are attempting a reboot at the same time
486 # then the kernel could lock up. Here we have a crude lock which
487 # doesn’t eliminate but drastically reduces the likelihood of getting
488 # two reboots at once.
489 if [ “$PLATFORM” = “Linux” ]; then
491 if [ -e “$REBOOTLOCKFILE” ]; then
492 CEDETO=`$CAT $REBOOTLOCKFILE`
494 $ECHO $$ > $REBOOTLOCKFILE
496 if [ ! -z “$CEDETO” ]; then
497 REBOOT_CMD=”$SLEEP 0″
498 $LOGMSG “Oracle init script ceding reboot to sibling $CEDETO.”
502 $EVAL $REBOOT_CMD