RAC Expert or Clusters Expert?

BLOG UPDATE 2011.08.11 : For years my criticism of Oracle Clusterware fencing methodology brought ire from many who were convinced I was merely a renegade. The ranks of “the many” in this case were generally well-intended but overly convinced that Oracle was the only proven clustering technology in existence.  It took many years for Oracle to do so, but they did finally offer support for IPMI fencing integration in the 11.2 release of Oracle Database. It also took me a long time to get around to updating this post.  Whether by graces of capitulation or a reinvention of the wheel, you too can now, finally, enjoy a proper fencing infrastructure. For more information please see: http://download.oracle.com/docs/cd/E11882_01/rac.112/e16794/admin.htm#CHDGIAAD

Introducing the Oracle SMP Expert. What is a Spinlock?
I am not joking when I tell you that I met an individual last year that billed himself as an “Oracle SMP expert.” That is fine and dandy, but through the course of our discussion I realized that this person had a severely limited understanding of the most crucial concept in SMP software scalability—critical sections. It wasn’t necessarily the concept of critical sections this individual didn’t really understand, it was the mutual exclusion that must accompany critical sections on SMP systems. In Oracle terms, this person could not deliver a coherent definition for what a latch is—that is, he didn’t understand what a spinlock was and how Oracle implements them. An “Oracle SMP expert” that lacks even cursory understanding of mutual exclusion principles is an awful lot like a “RAC expert” that does not have a firm understanding of what the term “fencing” means.

I have met a lot of “RAC experts” in the last 5 years who lack understanding of clusters principles—most notably what the term “fencing” is and what it means to RAC. Fencing is to clusters what critical sections are to SMP scalability.

Is it possible to be a “RAC expert” without being a cluster expert? The following is a digest of this paper about clusterware I have posted on the Oaktable Network website. For that matter, Julian Dyke and Steve Shaw accepted some of this information for inclusion in this RAC book.

Actually, I think getting it in their book was a part of the bribe for the technical review I did of the book (just joking).

I Adore RAC and Fencing is a Cool Sport!
No, not that kind of fencing. Fencing is a generic clustering term relating to how a cluster handles nodes that should no longer have access to shared resources such as shared disk. For example, if a node in the cluster has access to shared disk but has no functioning interconnects; it really no longer belongs in the cluster. There are several different types of fencing. The most common type came from academia and is referred to by the acronym STOMITH which stands for Shoot The Other Machine In The Head. A more popular variant of this acronym is STONITH where “N” stands for Node. While STONITH is a common term, there is nothing common with how it is implemented. The general idea is that the healthy nodes in the cluster are responsible for determining that an unhealthy node should no longer be in the cluster. Once such a determination is made, a healthy node takes action to power cycle the errant node. This can be done with network power switches for example. All told, STONITH is a “good” approach to fencing because it is generally built upon the notion that healthy nodes monitor and take action to fence unhealthy nodes.

This differs significantly from the “fencing” model implemented in Oracle Clusterware, which doesn’t implement STONITH at all. In Oracle Clusterware, nodes fence themselves by executing the reboot(8) command out of the /etc/init.d/init.cssd. This is a very portable approach to “fencing”, but it raises the question of what happens if the node is so unhealthy that it cannot successfully execute the reboot(8) command. Certainly we’ve all experienced systems that were so incapacitated that commands no longer executed (e.g., complete virtual memory depletion, etc.). In a cluster it is imperative that nodes be fenced when needed, otherwise they can corrupt data. After all, there is a reason the node is being fenced. Having a node with active I/O paths to shared storage after it is supposed to be fenced from the cluster is not a good thing.

Oracle Clusterware and Vendor Clusterware in Parallel
On all platforms, except Linux and Windows, Oracle Clusterware can execute in an integrated fashion with the host clusterware. An example of this would be Oracle10g using the libskgx[n/p] libraries supplied by HP for the MC ServiceGuard environment. When Oracle runs with integrated vendor clusterware, Oracle makes calls to the vendor-supplied library to perform fencing operations. This blog post is about Linux, so the only relationship between vendor clusterware and Oracle clusterware is when Oracle-validated compatible clusterware runs in parallel with Oracle Clusterware. One such example of this model is Oracle10g RAC on PolyServe Matrix Server.

In situations where Oracle’s fencing mechanism is not able to perform its fencing operation, the underlying validated host clusterware will fence the node, as is the case with PolyServe Matrix Server. It turns out that the criteria used by Oracle Clusterware to trigger fencing are the same criteria that host clusterware uses to take action. Oracle instituted the Vendor Clusterware Validation suites to ensure that underlying clusterware is compatible and complements Oracle clusterware. STONITH is one form of fencing, but far from the only one. PolyServe supports a sophisticated form of STONITH where the healthy nodes integrate with management interfaces such as Hewlett-Packard’s iLO (Integrated Lights-Out) and Dell DRAC. Here again, the most important principle of clustering is implemented—healthy nodes take action to fence unhealthy nodes— which ensures that the fencing will occur. This form of STONITH is more sophisticated than the network power-switch approach, but in the end they do the same thing—both approaches power-cycle unhealthy nodes. However, it is not always desirable to have an unhealthy server power-cycled just for the sake of fencing.

Fabric Fencing
With STONITH, there could be helpful state information lost in the power reset. Losing that information may make cluster troubleshooting quite difficult. Also, if the condition that triggered the fencing persists across reboots, a “reboot loop” can occur. For this reason, PolyServe implements Fabric Fencing as the preferred option for customers running Real Application Clusters. Fabric Fencing is implemented in the PolyServe SAN management layer. PolyServe certifies a comprehensive list of Fiber Channel switches that are tested with the Fabric Fencing code. All nodes in a PolyServe cluster have LAN connectivity to the Fiber Channel switches. With Fabric Fencing, healthy nodes make SNMP calls to the Fiber Channel switch to disable all SAN access from unhealthy nodes. This form of fencing is built upon the sound principle of having healthy servers fence unhealthy servers, but the fenced server is left in an “up” state—yet completely severed from shared disk access. Administrators can log into it, view logs and so on, but before the node can rejoin the cluster, it must be rebooted.

Kernel Mode Clusterware
The most important aspect of host clusterware, such as PolyServe, is that it is generally implemented in Kernel Mode. In the case of PolyServe, the most critical functionality of SAN management, cluster filesystem, volume manager and so on are implemented in Kernel Mode on both Linux and Windows. On the other hand, when fencing code is implemented in User Mode, there is always the risk that the code will not get processor cycles to execute. Indeed, with clusters in general, overly saturated nodes often need to be fenced because they are not responding to status requests by other nodes in the cluster. When nodes in the cluster are getting so saturated as to trigger fencing, having critical clusterware code execute in Kernel Mode is a higher level of assurance that the fencing operation will succeed. That is, if all the nodes in the cluster are approaching a critical state and a fencing operation is necessary against an errant node, having Kernel Mode fencing architected as either robust STONITH or Fabric Fencing ensures the correct action will take place.

Coming Soon
What about SCSI-III Persistent Reservation. Isn’t I/O fencing as good as server fencing? No, it isn’t.

33 Responses to “RAC Expert or Clusters Expert?”


  1. 1 amit poddar December 4, 2006 at 12:06 am

    So there are two options

    1. Run Oracle Clusterware integrated with vendor clusterware
    2. Run Oracle Clusterware and Vendor clusterware in parallel

    The question is are both these option available on all the platforms ?

    amit

  2. 2 Kevinclosson December 4, 2006 at 12:58 am

    Hello Amit,

    Thanks for reading. The answer to your question is not straight forward (much to Oracle’s discredit actually). On all Legacy Unix platforms there are a wide range of integrated options (e.g., MC Service Guard for HP-UX, VCS or SunClusters for Solaris, HCAMP for AIX, PolyServe Matrix Server and GFS for Linux, etc).

    The best place to look is Metalink->Certify and chase down the RAC TEchnology Certification Matrix for the platform you are interested in.

    The main thing to remember is that RAC is purchased largely for fast response to systems failures. If you are maintaining a RAC deployment where there has never been any fault-injection testing performed, you might be sorely surprised some day when the iron takes a hit (for any reason).

  3. 3 Amir Hameed December 4, 2006 at 2:16 am

    Kevin,
    Per the following statement:
    “In situations where Oracle’s fencing mechanism is not able to perform its fencing operation, the underlying validated host clusterware will fence the node, as is the case with PolyServe Matrix Server.”
    I am curoius to know how the third-party host clusterware detects that Oracle clusterware was not able to properly perform the fencing operation and that it should now take charge of the fencing operation?

  4. 4 kevinclosson December 4, 2006 at 6:00 pm

    Host clusterware doesn’t detect Oracle’s fencing was unsuccessful. Oracle’s fencing mechanism is a system reboot. Host clusterware doesn’t care if some application is trying to reboot a server.

    The key to understanding all this is to first look at why Oracle is trying to a fence a server (with that reboot command), not to focus on why the fencing may not succeed (for such reasons as processor saturation or OS hangs or whatever). If Oracle is trying to fence a server that can’t see the SAN for instance, it can only hope to beat host clusterware to the punch as it were. That is, host clusterware is going to fence a server that can’t see the SAN regardless of what Oracle thinks about the situation. That is a good thing. That is fundamental clustering. It also makes Oracle’s attempt to execute the reboot command a moot point.

    There are a lot of common conditions where host and Oracle clusterware will fence servers. The point being that if Oracle can’t get the job done, the validated host clusterware will do it because there is a condition that would have caused a fencing operation regardless of whether Oracle was running on the node.

    There are those times when Oracle decides for fully internal reasons to fence a node. This would be a case where host clusterware would not be there to augment Oracle. That is, if Oracle determines for some internal reason to fence a server, but that server also for some reason fails to execute the reboot command (and that reason is also not a fencing condition from the host clusterware perspective), then what you have is a 100% Oracle failed fencing operation. That can happen on systems with or without vendor clusterware.

    Remember, we are talking about non-integrated (but validated) host clusterware. These are some of the reasons Oracle instituted the Clusterware Compatibility testing program.

  5. 5 Allan Nelson December 5, 2006 at 8:10 pm

    What are some good printed and web sources for learning about basic clustering theory and implementations?

  6. 6 amit poddar December 13, 2006 at 3:08 pm

    “The most important aspect of host clusterware, most notably PolyServe, is that it is generally implemented in Kernel Mode.”

    I was going throug init.css scripts it etc directory. There I saw that oracle’s clusterware is capable of invoking a fast reboot i.e. on linux it does reboot -n -f

    from reboot man page

    -n Donât sync before reboot or halt
    -f Force halt or reboot, donât call shutdown(8).

    The -h flag puts all harddisks in standby mode just before halt or
    poweroff. Right now this is only implemented for IDE drives. A side
    effect of putting the drive in standby mode is that the write cache on
    the disk is flushed. This is important for IDE drives, since the kernel
    doesnât flush the write-cache itself before poweroff.

    Moreover we can even renice the crs/css processes to have a very high priority.

    Shouldn’t the above two take care of the drawbacks you mention. Since reboot (having a very high priority) will get cpu time even when other processes are hung.

    thanks
    amit

  7. 7 kevinclosson December 13, 2006 at 5:20 pm

    Amit,

    No, absolutely not. What leads up to the reboot command? The script has to fork itself and then the kid has to make it through exec. Doing so requires process slots, file descriptors, VM, etc etc. To renice a process only increases the user mode scheduling priority. All runable kernel mode processes will run first. It is quite easy to starve the crs scripts of CPU scheduling. Quite easy.
    And there is a gap between that point and when hangcheck kicks in.

    Now don’t get wrapped up in the Kernel versus User Mode point though. It doesn’t have much to do with fencing. When it comes to fencing, real STONITH and Fabric Fencing (PolyServe does both, other do STONITH) is architected such that the other nodes will take action to fence the troubled server. I only mention the Kernel mode aspect because it comes into play at other times as well (that don’t have to do with fencing).

  8. 8 amit poddar December 13, 2006 at 7:04 pm

    “It is quite easy to starve the crs scripts of CPU scheduling. Quite easy.”

    Can you come up with a scenario ?

    “And there is a gap between that point and when hangcheck kicks in.”

    I did not understand this statement.

    We are designing our hardware for RAC, and I want to be able to guide my bosses to make an informed decesion. For doing that I need to understand this stuff myself. Hence the question

    thanks for taking time to help me.

    amit

  9. 9 amit poddar December 13, 2006 at 7:18 pm

    “All runable kernel mode processes will run first.”

    If we renice the crs process its priority will be higher than the the other oracle processes (even with cpu penalty in place, since oracle processes would consume more CPU that the crs processes itself).

    So where is the chance of corruption, since only oracle processes accessing the database would corrouption ?

    I think what I don’t get is how would corruption take place even when say crs processes are starved of CPU, since only processes which will run on the sick node will be kernel processes which have nothing to do with Oracle database?

    thanks

  10. 10 kevinclosson December 13, 2006 at 8:09 pm

    Amit,

    Please, I’m glad you are a reader, but think about what you are asking me. Let’s take the conversation up a notch. OK, how do I starve a reniced crs shell script process. Hmmm do I have to list the nearly infinite reasons some user mode process can’t fork and exec a child process? Or if he can actually get through the spawn, will the child (reboot(8)) actually be able to open any files, allocate any memory, etc? How about if he is swapped out and memory depletion is chronic. Ever seen that? Some user mode process that is reniced is not magically going to get even so much as his page tables swapped back in unless there is memory available. We are talking about clusterware dealing with pathological situations. Those are the ONLY situations that really matter where clusterware is concerned. Consider also a device driver bug where all processors are getting hammered with errant interrupts. Processors in interrupt handlers to not take time off to schedule in some user mode bash(1) process–reniced or not. This stuff happens. And these scenarios prevent Oracle from being able to fence itself. That is the principle I’m talking about. Consider a simple shell program with a simple loop that executes children without waiting (no “&” that is)–boom, process table is full. How does the crs script fork and exec a process called reboot(8) if you can’t get a process slot?

    What you are coming up with are possible mitigating actions, but the fundamental problem will still exist. If a server has to FENCE ITSELF, then a missed fencing can occur. Simple. And “can” is an evil word to not brush aside with “remedies” when clusterware is concerned.

    Now, as for corruption, I haven’t been harping on that. There are a myraid of conditions where a missed fencing operation will not cause any database corruption. Other ill affect may be seen however, such as other nodes can’t join the cluster–or whatever. The fact is if you pay for fencing, you should get robust fencing. If you were on a Legacy Unix system, you chould choose from a variety of integrated host clusterware. Ask yourself why that is so? On Linux there is Oracle Validated Compatible Clusterware and there are a few to choose from. There is PolyServe and GFS on Intel Linux and IBM GPFS in Power Linux.

    If a server is supposed to fence itself, but cannot, it still has I/O paths, electricity, buffers to drain in the HBAs, and so on. It is a problem. A fundamental problem.

  11. 11 Kevinclosson December 13, 2006 at 8:46 pm

    “And there is a gap between that point and when hangcheck kicks in.”

    Did this confuse you because you don’t know what hangcheck is supposed to do on behalf of Oracle Clusterware?

  12. 12 amit poddar December 13, 2006 at 9:39 pm

    I thought that hangcheck was not required with 10g cluster ware since resetting a node is done by crs itself

  13. 13 kevinclosson December 13, 2006 at 10:09 pm

    Oh good heavens. Of course you need a hangcheck-timer. Now you see why I blog this topic. Hangcheck is the only prayer you have that a troubled server will not miss a fencing operation. It is crude, and doesn’t really do what it needs to do, but it is all you have.

    Not to single you out, Amit, but what I’m trying to inform people is that you can’t be a RAC expert without being a clusters expert.

    Read the following two FAQs:

    ML 232355.1 and 220970

  14. 14 amit poddar December 13, 2006 at 10:32 pm

    You should consider writing a book on clusters for people like me

    amit

  15. 15 kevinclosson December 13, 2006 at 11:04 pm

    Amit,

    It most certainly isn’t just you. If I had to bet money, I’d bet that less than 5% of the poeple that deploy RAC actually understand clusters. They generally get a crash course when things start to go wrong.

    Just look for my speaking engagements 🙂

    PS. Actually there is talk of a book, but don’t tell anyone.

  16. 16 Gooner December 14, 2006 at 1:47 am

    Kevin

    Whilst waiting for the book can you point us towards any good learning material?

  17. 17 kevinclosson December 14, 2006 at 1:58 am

    Wow, I get that one a lot. The knowledge needs to be drawn from so many areas and the technology is, frankly, boutique. Hard to find a to-the-point reference. You’ll get mash-jobs about academic shared-nothing clusters and all sorts of stuff that is not directly related to Oracle. Although the fencing models used in some of those have historically been cool. Unfortunately the worse place is Oracle doc and papers. They need to make clsuters less scary for a volume play so they’ll never go through the mud in a paper. That’s too bad because RAC is astounding technology. It really does deserve the best clusterware it can run on.

    I know, how about…read the blog 🙂 Oh, and pass the word 🙂

    Better yet, get our software (before the free promotion ends if you can).

    https://kevinclosson.wordpress.com/misc/free-polyserve-software/

    OK, I know, that was SPAM 😦

  18. 18 Robin Harris January 16, 2007 at 6:20 pm

    Kevin,

    For an introduction to clusters and cluster implementation issues I found helpful, try VAXclusters: A Closely-Coupled Distributed System. I don’t think you can buy VAXclusters any more, so this isn’t a plug.

    IMHO, the article does a good job of laying out the issues in cluster design, and while it does spend time explaining how VAXcluster hardware works, it is also pretty readable.

    I haven’t seen a more modern article that combines technical rigor with similar readability. But I’m open to pointers.

    Robin

  19. 19 kevinclosson January 16, 2007 at 8:17 pm

    Hi Robin,

    Thanks for stopping by. You can plug VAX anytime on my blog :-)…we argue that the closest thing out there to VAX-style clustering is PolyServe. 100% distributed **AND** symmetric.

    And, of course, I have to point out that I love storagemojo.com

  20. 20 goran January 22, 2007 at 6:35 pm

    Hi Kevin,

    Thanks for a very good post.
    The key issue (in my point) is to prevent data corruption in case when one of the nodes get into ‘undefined’ state.
    The possibility of data corruption exist regardless if we use Oracle clusterware or some other clusterware software, the point is to minimize the probability by choosing the clusterware with robust ‘fencing system’.
    Since Oracle clusterware is ‘free’, I would say one get what he pays for.

    Thanks,
    Goran

  21. 21 kevinclosson January 22, 2007 at 6:43 pm

    Goran,

    A very level-headed follow-up, Goran thanks. And Thanks for visiting my blog. I do wish to point out, however, that RAC costs about $60,000 per CPU so Oracle stock clusterware is anything but free.

  22. 22 Jared Still February 1, 2007 at 6:34 pm

    I am neither a RAC expert nor a Clustering expert.

    Thanks to this article however, I know a lot more than I did a few minutes ago.

    Thanks Kevin!

  23. 23 kevinclosson February 1, 2007 at 6:56 pm

    Jared,

    Thanks for stopping by!

  24. 24 sriram July 6, 2007 at 3:44 pm

    Hi ,

    Please can you comment on these observations.

    Please do correct me if i am understanding something wrong.

    Instance Eviction.

    This occurs when the members of the database cluster group not able to communicate with each other which may include LMON process of one instance not able to communicate with the LMON of the other instance due to a communications error in the cluster or one of of the RAC process spins, failure to issue a heartbeat to the control file due to instance death or due to a split brain.
    If communication is lost at the cluster layer (for example, network cables are pulled), the cluster software may also perform node evictions in the event of a cluster split-brain. Oracle will detect a possible split-brain by a method known as Instance Membership Recovery where in the healthy instance will remove the problem instance(s) from the cluster.
    When the problem is detected the instances ‘race’ to get a lock on the control file Results Record lock for updating.Implemented in the same way as vendor Clusterwares takes a poll of the quorom disk to decide on the cluster owner.The instance that obtains the lock tallies the votes of the instances to decide membership and inturn wait for cluster software to resolve the split-brain by calling the required fencing library ie when using Oracle Clusterware then it invovkes the OPROCD and in case a vendor clusterware is present it calls the node membership service routine name libskgxn.so. If cluster software does not resolve the split-brain within a specified interval,Oracle proceeds with instance evictions.
    But a potential catastophe could arise if pending I/O from a dead instance is to be flushed to the I/O subsystem.This is where I/O fencing plays a crucial part.The I/O fencing is to be performed at the cluster layer and not at the RAC Layer.When Using Oracle Clusterware Oracle relies on the OPROCD and the way it does is to reboot itself, once gone through the reboot process, it just do the rejoin cluster again, then the cluster can decide whether accept it or not.
    How ever this method of rebooting the system still doesnt prevent the pending I/O thats is being flushed to the I/O subsystem.but it raises the question of what happens if the node is so unhealthy that it cannot successfully execute the reboot command.Under such circumstances where oracle is not able to do the actual fencing operations ,then the Vendor clusterware if present would be do this with known technologies like SCSCI Reservations or Fabric Fencing rather than trying to kill the node.
    It is to be noted that a voting disk that is part of the Oracle Clusterware is a backup communication mechanism that allows CSS daemons to negotiate which subcluster will survive.The usage of this can be appreciated when we consider more than two nodes where in it prefers a greater number of nodes to take over the cluster again in this approach the pending I/O if any from the subcluster that is thrown out is not blocked.
    But having said this the RAC locks are still not released even when the there is a split brain and oracle strongly relies that these locks are reassigned only when the remote node is down,and may lead to a cluster hang as there is a strong dependency tied up between the Oracle Clusterware process and RAC Background process.On the other hand if the vendor clusterware is present and when oracle makes a call to the fence service module the clusterware does this operation independently of the RAC process and leading to a smooth cleanup of the faulty nodes.
    This tells us that the Fencing mehcanism provided by Oracle Clusterware strongly relies on RAC specific Locks.

  25. 25 kevinclosson July 6, 2007 at 4:40 pm

    Hello Sriram,

    I have no idea where to start with a follow up. So, I’ll just take a stab. First, RAC locks have nothing at all to do with adjustments to node membership. CRS is entirely responsible for how servers come and go from the cluster and that is whether or not it is integrated with a vendor supplied libskgxn.so or the generic Oracle-supplied libskgxn.so. The latter being the only choice for Linux and Windows.

    Oracle does not need protection from I/Os in flight. If it did, there would be a requirement for I/O fencing on all ports of RAC. Thank heavens that isn’t the case, because SCSI Res is pretty feeble stuff.

    BTW, I/O fencing as you know it doesn’t actually guarantee protection from I/Os. The only way to prevent I/Os is to not issue them, or to sever the I/O path as is done when fabric fencing (switch-based fencing) is implemented.

    All that being said, I’m not sure if your post was a series of questions or statements.

  26. 26 sriram July 6, 2007 at 7:12 pm

    Hi Thanks for your response.

    This was a series of questions so that i can frame some statements in a more understandable manner.

    Thanks for the clarification.

    I am aware that RAC locks has nothing to to do with the nodememberships.one of the metalink notes RAC FAQ mentions that Voting disk along with controlfile avoids data corruption and this is ensured by releasing all the RAC specific locks after the unhealthy node is rebooted so that any exclusive access to the resources held by either of the nodes is prevented.

    Also CRS process OCSSD closely interact with the LMON process so that it can get the communcation and node details from CSS.

    I was looking for clarifications on these points.

    Thanks again

  27. 27 Alex Gorbachev July 7, 2007 at 1:53 pm

    If I’m allowed to step in for a second…

    When a RAC instance (talking about an instance here, not a node) leaves the cluster, the rest of the instances are going through reconfiguration process. One of the instance, as part of reconfiguration, will clean up the locks from a dead instance as well as perform crash recovery. This process has nothing to do with node membership and node eviction. The same happens on instance shutdown abort.

  28. 28 kevinclosson July 7, 2007 at 4:11 pm

    Alex,

    Thanks for that. I find that in general, most folks think the database has to do with the cluster, and it doesn’t nearly as much as people think. In general, clusterware minds the cluster, the DLM minds the database. Back in Oracle 7 OPS days it was a little simpler to “see” that because Oracle spent significant portions of its time making ioctl()s to the kernel DLM and node membership infrastructure.

  29. 29 sriram July 10, 2007 at 2:05 pm

    Hi alex/kevin

    I had questions on the same after reading the RAC handbook by K.GOPALAKRISHNAN

    Below is an exceprt from it which talks about RAC and Nodemembership & its interaction with the Clusterware.
    Please can you comment based on this.

    IMR is a part of the service offered by Cluster Group Services (CGS). LMON is the key process that handles many of the CGS functionalities. As you know, cluster software (known as Cluster Manager, or CM) can be a vendor-provided or Oracle-provided infrastructure tool. CM facilitates communication between all nodes of the cluster and provides information on the health of each node—the node state. It detects failures and manages the basic membership of nodes in the cluster. CM works at the cluster level and not at the database or instance level.

    Inside RAC, the Node Monitor (NM) provides information about nodes and their health by registering and communicating with the CM. NM services are provided by LMON. Node membership is represented as a bitmap in the GRD. A value of 0 denotes that a node is down and a value of 1 denotes that the node is up. There is no value to indicate a “transition” period such as during bootup or shutdown. LMON uses the global notification mechanism to let others know of a change in the node membership. Every time a node joins or leaves a cluster, this bitmap in the GRD has to be rebuilt and communicated to all registered members in the cluster.

    The details are there at

    http://searchsystemschannel.techtarget.com/general/0,295582,sid99_gci1254273,00.html

  30. 30 Alex Gorbachev July 10, 2007 at 11:30 pm

    Unless I’m seriously missing something, I think that instead of node, the term instance should have been used and by cluster it was meant Oracle RAC cluster and not CRS cluster itself.
    This would be less confusing with 9i but in 10g instance membership in RAC database cluster and node membership in CRS are two different things.


  1. 1 RAC e Cluster « Oracle and other Trackback on March 22, 2007 at 1:02 pm
  2. 2 DanNorris.com » Oracle Clusterware & Fencing Trackback on August 16, 2007 at 2:47 pm
  3. 3 Oracle Clusterware and Fencing…Again? « Kevin Closson’s Oracle Blog: Platform, Storage & Clustering Topics Related to Oracle Databases Trackback on August 17, 2007 at 6:50 pm

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,944 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: