BLOG UPDATE 2011.08.11 : For years my criticism of Oracle Clusterware fencing methodology brought ire from many who were convinced I was merely a renegade. The ranks of “the many” in this case were generally well-intended but overly convinced that Oracle was the only proven clustering technology in existence. It took many years for Oracle to do so, but they did finally offer support for IPMI fencing integration in the 11.2 release of Oracle Database. It also took me a long time to get around to updating this post. Whether by graces of capitulation or a reinvention of the wheel, you too can now, finally, enjoy a proper fencing infrastructure. For more information please see: http://download.oracle.com/docs/cd/E11882_01/rac.112/e16794/admin.htm#CHDGIAAD
I’ve covered the clusters concept of fencing quite a bit on this blog (e.g., RAC Expert or Clusters Expert and Now is the Time to Open Source, etc), and in papers such as this paper about clusterware, and in an appendix in the Julian Dyke/Steve Shaw book about RAC on Linux. If I’ve said it once, I’ve said it 1000 times; if you are not a clusters expert you cannot be a RAC expert. Oddly though, Oracle seems to be sending a message that clusterware is commoditized—and it really isn’t. On the other hand, Oracle was brilliant for heading down the road of providing their own clusterware. Until all the kinks are worked out, it is good to know as much as you can about what is under the covers.
Linux RAC “Fencing”
As I’ve pointed out in the above referenced pieces, Oracle “fencing” is not implemented by healthy servers taking action against rogue servers (e.g., STONITH), but instead the server that needs to be “fenced” is sent a message. With that message, the sick server will then reboot itself. Of course, a sick server might not be able to reboot itself. I call this form of fencing ATONTRI (Ask The Other Node To Reboot Itself).This blog entry is not intended to bash Oracle clusterware “fencing”—it is what it is, works well and for those who choose there is the option of running integrated Legacy clusterware or validated third party clusterware to fill in the gaps. Instead, I want to blog about a couple of interesting observations and then cover some changes that were implemented to the Oracle init.cssd script under 10.2.0.3 that you need to be aware of.
Logging When Oracle “Fences” a Server
As I mentioned in this blog entry about the 10.2.0.3 CRS patchset, I found 10.2.0.1 CRS—or is that “clusterware”—to be sufficiently stable to just skip over 10.2.0.2. So what I’m about to point out might be old news to you folks. The logging text produced by Oracle clusterware changed between 10.2.0.1 and 10.2.0.3. But, since CRS has a fundamental flaw in the way it logs this text, you’d likely never know it.
Lot’s of Looking Going On
As an aside, one of the cool things about bloggingis that I get to track the search terms folks use to get here. Since the launch of my blog, I’ve had over 11000 visits from readers looking for information about the most common error message returned if you have a botched CRS install on Linux—that text being:
PROT-1: Failed to initialize ocrconfig
No News Must Be Good News
I haven’t yet blogged about the /var/log/messages entry you are supposed to see when Oracle fences a server, but if I had, I don’t think it would be a very common google search string anyway? No the reason isn’t that Oracle so seldomly needs to fence a server. The reason is that the text generally (nearly never actually) doesn’t make it into the system log. Let’s dig into this topic.
The portion of the init.cssd script that acts as the “fencing” agent in 10.2.0.1 is coded to produce the following entry in the /var/log/messages file via the Linux logger(1) command (line numbers precede code):
194 LOGGER=”/usr/bin/logger”
[snip]
1039 *)
1040 $LOGERR “Oracle CSSD failure. Rebooting for cluster integrity.”
1041
1042 # We want to reboot here as fast as possible. It is imperative
1043 # that we do not flush any IO to the shared disks. Choosing not
1044 # to flush local disks or kill off processes gracefully shuts
1045 # us down quickly.
[snip]
1081 $EVAL $REBOOT_CMD
Let’s think about this for a moment. If Oracle needs to “fence” a server, the server that is being fenced should produce the followingtext in /var/log/messages:
Oracle CSSD failure.Rebooting for cluster integrity.
Where’s Waldo?
Why is it when I google for “Oracle CSSD failure.Rebooting for cluster integrity” I get 3, count them, 3 articles returned? Maybe the logger(1) command simply doesn’t work? Let’s give that a quick test:
[root@tmr6s14 log]# logger “I seem to be able to get messages to the log”
[root@tmr6s14 log]# tail -1 /var/log/messages
Jan 9 15:16:33 tmr6s14 root: I seem to be able to get messages to the log
[root@tmr6s14 log]# uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
Interesting. Why don’t we see the string Oracle CSSD failure when Oracle fences then? It’s because the logger(1) command merely sends a message to syslogd(8) via a socket—and then it is off to the races. Again, back to the 10.2.0.1 init.cssd script:
22 # FAST_REBOOT – take out the machine now. We are concerned about
23 # data integrity since the other node has evicted us.
[…] lines deleted
177 case $PLATFORM in
178 Linux) LD_LIBRARY_PATH=$ORA_CRS_HOME/lib
179 export LD_LIBRARY_PATH
180 FAST_REBOOT=”/sbin/reboot -n -f”
So at line 1040, the script sends a message to syslogd(8) and then immediately forces a reboot at line 1081—with the –n option to the reboot(8) command forcing a shutdown without sync(1). So there you have it, the text is drifting between the bash(1) context executing the init.cssd script and the syslogd(8) process that would do a buffered write anyway. I think the planets must really be in line for this text to ever get to the /var/log/messages file—and I think the google search for that particular string goes a long way towards backing up that notion. When I really want to see this string pop up in /var/log/messages, I fiddle with putting sync(1) comands and sleep before the line 1081. That is when I am, for instance, pulling physical connections from the Fibre Channel SAN paths and studying what Oracle behaves like by default.
By the way, the comments at lines 22-23 are the definition of ATONTRI.
Paranoia?
I’ve never understood that paranoia at lines 1042-1043 which state:
We want to reboot here as fast as possible. It is imperative that we do not flush any IO to the shared disks.
It may sound a bit nit-picky, but folks this is RAC and there are no buffered writes to shared disk! No matter really, even if there was a sync(1) command at line 1080 in the 10.2.0.1 init.cssd script, the likelihood of getting text to /var/log/messages is still going to be a race as I’ve pointed out.
Differences in 10.2.0.3
Google searches for fencing articles anchored with the Oracle CSSD failure string are about to get even more scarce. In 10.2.0.3, the text that the script attempts to send to the /var/log/messages file changed—the string no longer contains CSSD, but CRS instead. The following is a snippet from the init.cssd script shipped with 10.2.0.3:
452 *)
453 $LOGERR “Oracle CRS failure. Rebooting for cluster integrity.”
A Workaround for a Red Hat 3 Problem in 10.2.0.3 CRS
OK, this is interesting. In the 10.2.0.3 init.cssd script, there is a workaround for some RHEL 3 race condition. I would be more specific about this, but I really don’t care about any problems init.cssd has in its attempt to perform fencing since for me the whole issue is moot. PolyServe is running underneath it and PolyServe is not going to fail a fencing operation. Nonetheless, if you are not on RHEL 3, and you deploy bare-bones Oracle-only RAC (e.g., no third party clusterware for fencing), you might take interest in this workaround since it could cause a failed fencing. That’s split-brain to you and I.
Just before the actual execution of the reboot(8) command, every Linux system running 10.2.0.3 will now suffer the overhead of the code starting at line 489 shown in the snippet below. The builtin test of the variable $PLATFORM is pretty much free, but if for any reason you are on a RHEL 4, Novell SuSE SLES9 or even Oracle Enterprise Linux (who knows how they attribute versions to that) the code at line 491 is unnecessary and could put a full stop to the execution of this script if the server is in deep trouble—and remember fencings are suppose to handle deeply troubled servers.
Fiddle First, Fence Later
Yes, the test at line 491 is a shell builtin, no argument, but as line 226 shows, the shell command at line 491 is checking for the existence of the file /var/tmp/.orarblock. I haven’t looked, but bash(1) is most likely calling open(1) with O_CREAT and O_EXCL and returning true on test –e if the open(1) call gets EEXIST returned and false if not. In the end, however, if checking for the existence for a file in /var/tmp is proving difficult at the time init.cssd is trying to “fence” a server, this code is pretty dangerous since it can cause a failed fencing on a Linux RAC deployment. Further, at line 494 the script will need to open a file and write to it. All this on a server that is presumed sick and needs to get out of the cluster. Then again, who is to say that the bash process executing the init.cssd script is not totally swapped out permanently due to extreme low memory thrashing? Remember, servers being told to fence themselves (ATONTRI) are not healthy. Anyway, here is the relevant snippet of 10.2.0.3 init.cssd:
226 REBOOTLOCKFILE=/var/tmp/.orarblock
[snip]
484 # Workaround to Redhat 3 issue with multiple invocations of reboot.
485 # Here if oclsomon and ocssd are attempting a reboot at the same time
486 # then the kernel could lock up. Here we have a crude lock which
487 # doesn’t eliminate but drastically reduces the likelihood of getting
488 # two reboots at once.
489 if [ “$PLATFORM” = “Linux” ]; then
490 CEDETO=
491 if [ -e “$REBOOTLOCKFILE” ]; then
492 CEDETO=`$CAT $REBOOTLOCKFILE`
493 fi
494 $ECHO $$ > $REBOOTLOCKFILE
495
496 if [ ! -z “$CEDETO” ]; then
497 REBOOT_CMD=”$SLEEP 0″
498 $LOGMSG “Oracle init script ceding reboot to sibling $CEDETO.”
499 fi
500 fi
501
502 $EVAL $REBOOT_CMD
Any comments/blogs on Sun’s cluster announcement:
http://www.sun.com/2007-0109/feature/index.jsp?intcmp=hp2007jan09_cluster_read
In particular QFS and fencing.
Thanks,
-Bob
Sun Cluster fencing is based on SCSI reservations similarly to Veritas. Veritas is using SCSI-3 or persistent reservations while Sun Cluster seems to be working with SCSI-2 disks as well.
Here is a bit of details – http://docs.sun.com/app/docs/doc/819-0421/6n2rmm7rn?a=view#caccajda
I will leave it to Kevin to point out the difference of fencing using SCSI reservations and one that Polyserve does which is fencing on FC HBA’s. 😉
Kevin,
About the question why the message doesn’t appear in /var/log/messages:
I am not sure whether the text that you quote from the init.cssd is actually copied, or you made a typo…
*** Quote:
194 LOGGER=”/usr/bin/logger”
[snip]
1039 *)
1040 $LOGERR “Oracle CSSD failure. Rebooting for cluster integrity.”
L $REBOOT_CMD
*** end Quote
Do you see the difference? The logger command, is stored into the variable $LOGGER (in line 194), but the variable used to initiate the logging (line 1040) is typed as $LOGERR (in stead of $LOGGER). That could be the reason why you don’t see the message in /var/log/messages.
If this is actually the case (I don’t have my 10.2.0.3 Clusterware environment with me here), this whole thing could just be the result of a (stupid?) typo.
On the other hand, if the typo is yours, this whole thing doesn’t apply. BTW your second quote from the 10.2.0.3 init.cssd also contains this typo.
Just something I noticed.
Arnoud Roth
AMIS Services
Kevin,
Thanks for this blog. I’ve always been aware that fencing is an essential component of supportable clustering and I know you have strong feelings on the subject, but I enjoyed the technical justification and clarity here.
When people ask me why this is so important, I’ll point them at this blog 😉
I am sorry, the quote came out a bit erroneous. It should have been:
*** Quote:
194 LOGGER=”/usr/bin/logger”
[snip]
1039 *)
1040 $LOGERR “Oracle CSSD failure. Rebooting for cluster integrity.”
…
[snip]
1081 $EVAL $REBOOT_CMD
*** end Quote
Regards,
Arnoud Roth
AMIS Services
Again, it is me. I took a look at my own version of init.cssd, and found out that both LOGGER and LOGERR are defined to /usr/bin/logger or, if the file doesn’t exist there, /bin/logger.
So, no typos, just confusion. But, hey, Oracle is known to be good at that!
Surprising though, that what you found is the case. I would assume (hehe!) Oracle would have tested stuff like this…
(As system administrator, I would consider it rather important to be able to track down why the system suddenly rebooted).
How naive can one be.
Regards,
Arnoud Roth
Arnoud,
LOGERR uses $LOGGER…
Thanks for visiting!
Doug,
Nah, you just like me because we’ve beered! 🙂
Hi Bob,
RE Jan 9 Sun Cluster announcement, yes I have comments. I will blog about it. In short, I think it is just great that real platform providers are being allowed to help out Oracle because without such platforms RAC is a mess. I’ll have to dig in to see what the architecture is like, but if I recall, you need to implement a “third leg” quorum server to help vote out split brains with QFS so you know I’m not going to like that–I am only a fan of fully distributed and symmetric technology.
Kevin, Oracle 10.2 is a revolutionary release for many reasons. First, Oracle started hiding SQL statements issued by the background processes. I am able to see them, but I have to attach in oradebug and turn on the event 10046. Second, Oracle is hiding internal details. Cluster configuration is less documented then ever. Oracle documentation detailing cache fusion, the crucial difference between RAC and OPS, is a joke. If anybody who has read it can honestly tell
that he understands cache fusion, I’ll eat my Blackberry with mayonaise and ketchup. Code quality went from “bad” to “deplorable” and then to “loathsome”. Everything is full of patchworked Java Swing interfaces and XML files, LMD processes suddenly start spinning and consuming CPU like crazy, and processes wait on global cache latches, undocumented events with totally incomprehensible arguments which do not help you in any way, shape or form and 3rd party vendors are being desupported at an alarming rate. I believe that Oracle has become Microsoft. I definitely do not like the new behaviour of the company.
Mladen, don’t forget some mustard on your Blackberry, they go down much better that way. At least the newer models.
Kevin, thanks for the great article. I’ve always had a problem with the way CRS/Cluster Manager/Clusterware does its logging, as well as the lack of documentation on the subject. Your dissection of init.cssd is excellent and lives up to my note in Log Buffer last week regarding your “uncanny ability to read between the lines.”
Kevin, thanks for the great article. I’ve always had a problem with the way CRS/Cluster Manager/Clusterware does its logging, as well as the lack of documentation on the subject. Your dissection of init.cssd is excellent and lives up to my note in Log Buffer last week regarding your “uncanny ability to read between the lines.”
…kind words Steve, thanks. We’ll be reading between some real fun lines soon since I’m about to start in on NUMA…thanks for stopping by
Kevin,
I always knew the way Oracle clusterware worked was bad. Thanks for re-assuring me in such great details. It almost felt as I am right next to you as you show the code.
Thinking about it, it shouldn’t be too hard to simulate a data corruption case, knowing how poorly this is implemented?
Christo,
Thanks for stopping by. I’ll just say that a split-brain is one of the best ways to get corruption. I’m only trying to get the point across about how crucial fencing is and why we all need to take it so seriously.
Nothing makes a point more clear then a test case.
“split-brain” just doesn’t sound menacing enought
Simulating this, without pulling cables, just over-loading a server and causing any type of corruption would definetelly send a clear message.
Christo,
I have no message to send. CRS is not responsible for fencing when running on validated third party clustered (RAC Technology Certification Matrix). I don’t whether CRS fencing works because the node is going to get fenced by PolyServe no matter what.
Hi Kevin
Great article and a great blog for a clutering newbie as me.
I have a question about Split Brain and how Oracle Clusterware handles it, regardless the plaform, say Solaris, HP-UX, Linux whatever.
In a two node RAC when I pull the interconnect cable from node 2, after a while (30 seconds cssmisscount) the node is rebooted writing these messages
[ CSSD]2007-03-27 08:50:44.810 [8] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(45984) LATS(46609011) Disk l
astSeqNo(45984)
[ CSSD]2007-03-27 08:50:44.870 [20] >ERROR: clssnmCheckDskInfo: Terminating local instance to avoid splitbrain.
[ CSSD]2007-03-27 08:50:44.870 [20] >ERROR: : Node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(1)
[ CSSD]2007-03-27 08:50:44.870 [20] >TRACE: clssscctx: dump of 0x520740, len 3808
Whereas in node 1 where it survives writes
[ CSSD]2007-03-27 08:50:44.200 [8] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(2) wrtcnt(45989) LATS(46597381) Disk l
astSeqNo(45989)
[ CSSD]2007-03-27 08:50:44.900 [20] >TRACE: clssnmCheckDskInfo: node(2) missCount(34) state(0). Smaller(1) cluster node 2. mine
is 1. (2/1)
[ CSSD]2007-03-27 08:50:44.900 [20] >TRACE: clssnmEvict: Start
[ CSSD]2007-03-27 08:50:44.900 [20] >TRACE: clssnmEvict: Evicting node 2, birth 1, death 2, killme 1
………….
[ CSSD]2007-03-27 08:51:00.020 [13] >TRACE: clssnmUpdateNodeState: node 1, state (3/3) unique (1174931803/1174931803) prevConun
i(0) birth (1/1) (old/new)
[ CSSD]2007-03-27 08:51:00.020 [13] >TRACE: clssnmUpdateNodeState: node 2, state (0/0) unique (1174931798/1174931798) prevConun
i(1174931798) birth (1/0) (old/new)
[ CSSD]2007-03-27 08:51:00.020 [13] >TRACE: clssnmDeactivateNode: node 2 (oradb02) left cluster
I understand that voting disk is used to determined split brain in CRS what i dont really get is how does node 1 or node 2 knows node 2 is the one who lost the private interconnect therefore is evicted?
The blog entry is about Linux, what do you think about oprocd fencing in other unix flavors such as Solaris?
David
Just dropped by to tell you about an incident which happened at one of my sites where the OCR got corrupted thanks to 2 clusters writing to the same OCR. Oracle does not allow 2 clusters to write to the same database and we get an ORA-600 [kccsbck_first] but it allows 2 clusters to write to the same OCR.
So what happened was that the first node started with the Veritas libskgxn and all other nodes started with the Oracle libskgxn.We had 2 OCR Masters writing to the same OCR.
Have you seen this anywhere?
Fairlie,
I have not seen that sort of a problem, but is seems quite reasonable. I’m sure skgxn plays a huge part in ensuring the correctness of the OCR. So, having essentially two domains of skgxn (one with Veritas SFRAC and the other with Oracle Clusterware) would leave room for problems.
hello
i am pretty much new to this topic..
we have a sun cluster with oracle rac installed in it..
i am performing the following scenario…
node 1 -> init 0
node 2 -> reboot (when node 1 is still down )
after this node 2 comes up nicely..
when i bring up node 1 again, node2 reboots ..
i just want to know whether its the normal scenario
or if something is going wrong somewhere…