General Purpose Clusterware | Kevin Closson's Blog: Platforms, Databases and Storage

In a recent post on the oracle-l list, a participant asked:

Hi, has anyone used 10.2 Clusterware with OCFS2 on RHEL5 to get single instance failover from one host to another?

My buddy Matt Zito (we’ve had beers before so we’re buddies) of GridApp followed up with:

I have a customer that does that – it apparently works very well […text deleted…]

However, the downside of CRS as single-instance is that both sides of the cluster need to be licensed for Oracle (as I understand the CRS license).

Licensing
Licensing is always the topic for interesting conversation. To get to the bottom of this, I sent an email to the first Oracle person I ever heard pitch the idea of CRS for non-RAC purposes-Marshall Presser. Hmm, I think I can call him my old buddy too since we also had beers. Or then again if I’m not mistaken Marshall is an old Pyramid_Technology guy and since I am an old Sequent_Computer_Systems guy, we are sort of long-lost cousins. Anyway, back to the topic. Marshall was nice enough to send me a very current reference for Oracle’s licensing terms for using CRS for non-RAC purposes with a quote from Oracle® Database Licensing Information11g Release 1 (11.1) Part Number B28287-01:

Oracle Clusterware can be installed and used to protect any Oracle or third-party software provided any of the following conditions are met:

1. The software being protected is from Oracle.

2. The software being protected uses an Oracle Database.

3. The software being protected is running on Oracle Unbreakable Linux.

4. The software being protected is running in a cluster where at least one machine involved in the cluster is licensed using the appropriate metric for either Oracle Database Enterprise Edition or Oracle Database Standard Edition. A cluster is defined to include all the machines that share the same Oracle Cluster Registry (OCR) and Voting Disk

Unclear Clarity
So, as is usually the case with licensing, we have unclear clarity. And, yes, I know this is 11g information and the original query was about 10g, but it stands to reason that with some digging there would be a 10g equivalent. I wonder why criteria 1 above is stated. Since only 1 criteria is needed, I suppose we can interpret as follows:

You can use CRS on Unbreakable Linux for anything you want (rule 3)
You can protect non-RAC Oracle databases on any platform (rule 1)
You can protect any software the connects to an Oracle database on any platform (rule 2)
You can protect anything on any cluster as long as one node in the cluster is running an instance of EE or SE (rule 4)

These are pretty liberal rules. I think Oracle is keen on widespread adoption of Oracle Clusterware for general purpose HA, but then I could be misreading the tea leaves.

What Does This Really Mean?
What we’re talking about here is using CRS to monitor (“check” in CRS parlance, “probe” in generic industry terms) an instance of Oracle and take action if the action program fails. In general failover HA terms, probes (checks in CRS terms) fail as follows:

The server is up but the database is down
The server is down

Failover
In case 1 above, the HA engine will restart the database and in case 2 it will fail the database over to another server. The HA engine (in this case CRS) is smart enough to fail the service over to a system that is actually alive and has functional disk access and network interfaces. That is one the roles of any HA clusterware (e.g., CRS, Steel Eye, VCS, Service Guard, HACMP, Red Hat Cluster Suite, PolyServe, etc).

Time Outs
The other way the HA engine will take action is if your probe (check script/program) seizes (times out). In that case, most HA engines will execute “restart” action which is generally a stop action followed by a start action and another probe (check). This is not an endless loop though. Most HA engines have a tunable max for retries (restart attempts in CRS) and then it will failover to the defined backup server. Be aware, however, that a seized service (such as a non-RAC database instance) could be so locked up it didn’t stop when the HA engine tried its restart action. In that case, you have Oracle processes with files open. If you failover to a server that accesses the database on a shared filesystem such as NFS or OCFS, you have some things to be concerned about. You won’t be able to start the instance until the $ORACLE_HOME/dbs/lk${ORACLE_SID} file is removed, but simply removing it still leaves that other catatonic instance up on the ill server. These solutions can become complex.

The topic of what probe (check) actions are appropriate is the subject matter of very long, drawn-out discussions rife with theory and prejudice. I’ve been there and I’ve done that. I bet most folks that use CRS to start/stop and check non-RAC databases will likely use the script interface. Note, as with all HA engines out there, you can write a C probe (or CRS action program) because all the engine is looking for is a return code (success/failure).

I think the most clever probe action I’ve heard to date came from fellow OakTable Network member Tim Gorman. Tim once suggested that a great probe action would be to make a purposeful failed attempt to connect such as:

$ sqlplus foo/bar <<EOF 2>&1 | grep 1017
> REM There is no user called foo...expect ORA 1017
> exit;
> EOF
ORA-01017: invalid username/password; logon denied
$ echo $?
0

If you get anything other than ORA-01017, something is ill. In this case, a success for grep(1) is a success for the probe/check. That is, if grep(1) gets it’s text, the server returned ORA-01017 thus the instance was well enough to perform the functionality of user authentication. Your check script would get this in grep(1)’s return code ($?).

Trying to connect as a bogus user actually tests quite a bit of server functionality (SQL parsing, user authentication and so forth). I think this may actually create a temporary session as well. It certainly tests the server’s ability to fork(2) sqlplus and exec(2) $ORACLE_HOME/bin/oracle so you are testing the OS VM, process slots, etc. All in all, it is a very clever probe (check action). If you wanted to use CRS to check both the health of SQL*Net and a non-RAC database instance, then you could do this same bogus connect attempt through the listener. If the listener is down, you’ll get the appropriate error text. Then again, if you wanted to make a heavy probe/check, you could connect as an application user and update a dummy row in a table or something like that. The sky is the limit with this sort of HA kit.

Additional Material
Oracle has more information in the form of whitepapers:

	Optimize replication… on Introducing SLOB – The S…
	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'General Purpose Clusterware' Category

Using Oracle Clusterware for Non-RAC Purposes

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright