In Alejandro Vargas’ blog entry about RAC & ASM, Crash and Recovery Test Scenarios, some tests were described that would cause RAC failovers. Unfortunately, none of the faults described were the of the sort that put clusterware to the test. The easiest types of failures for clusterware to handle are complete, clean outages. Simply powering of a server, for instance, is no challenge for any clusterware to deal with. The other nodes in the cluster will be well aware that the node is dead. The difficult scenarios for clusterware to respond to are states of flux and compromised participation in the cluster. That is, a server that is alive but not participating. The topic of Alejandro’s blog entry was not a definition of a production readiness testing plan by any means, but it was a good segway into the comment I entered:
These are good tests, yes, but they do not truly replicate difficult scenarios for clusterware to resolve. It is always important to perform manual fault-injection testing such as physically severing storage and network connectivity paths and doing so with simultaneous failures and cascading failures alike. Also, another good test to [run] is a forced processor starvation situation by forking processes in a loop until there are no [process] slots [remaining]. These […] situations are a challenge to any clusterware offering.
Clusterware is Serious Business
As I pointed out in my previous blog entry about Oracle Clusterware, processor saturation is a bad thing for Oracle Clusterware—particularly where fencing is concerned. Alejandro had this to say:
These scenarios were defined to train a group of DBA’s to perform recovery, rather than to test the clusterware itself. When we introduced RAC & ASM we did run stress & resilience tests. The starvation test you suggest is a good one, I have seen that happening at customer sites on production environments. Thanks for your comments.
Be Mean!
If you are involved with a pre-production testing effort involving clustered Oracle, remember, be evil! Don’t force failover by doing operational things like shutting down a server or killing Oracle clusterware processes. You are just doing a functional test when you do that. Instead, create significant server load with synthetic tests such as wild loops of dd(1) to /dev/null using absurdly large values assigned to the ibs argument or shell scripts that fork children but don’t wait for them. Run C programs that wildly malloc(2) memory, or maybe a little stack recursion is your flavor—force the system into swapping, etc. Generate these loads on the server you are about to isolate from the network for instance. See what the state of the cluster is afterwards. Of course, you can purposefully execute poorly tuned Parallel Query workloads to swamp a system as well. Be creative.
Something To Think About
For once, it will pay off to be evil. Just make sure whatever you accept as your synthetic load generator is consistent and reproducible because once you start this testing, you’ll be doing it again and again—if you find bugs. You’ll be spending a lot of time on the phone making new friends.
regarding your comment about some people being crazy about open source products: generalizing that “they” don’t look either in the bug database nor the source code is just lame. first there is no “they” out there. second, it seems you have some misconception about open source.
the point is that you _can_ look in bug database and source code when _you_ want to, so you get some additional freedom in choice with open source. open source software certainly isn’t “better” than proprietary software just because it is open source, it rather depends on your needs.
so please correct the prejudice you display here, it just doesn’t fit your engineering views. personally i prefer to see all bugs and make an informed decision.
lhe,
I have personally spoken to people that have deployed open source products without ever having read the bug database. I maintain that doing so is, without a doubt, completely crazy. That is my viewpoint and thus I share it in *my* blog. I should hope that people can agree to disagree with me on that view if they need to. I agree to disagree with people all the time. At least it is on readers’ minds now.
Taking the thought further, I also think it is crazy to not spend a reasonable amount of time reading Oracle bugs in Metalink. As they say, “An ounce of prevention is better than a pound of cure.”
Finally, I do get your point about the fact that one _can_ read source and bugs for open source goodies when one wants to. I should think a **really** good time to do so is **before** one crams the stuff into a solution.
Thanks for reading.
Seems i misunderstood your posting. I read the paragraph about open-source crazy people as if you were talking about _anyone_. But your answer makes it clear that you had a significantly smaller group of people in mind. So i got you wrong and thought of you as another one of these FUD-types. Sorry for that.
lhe,
No problem. I appreciate your participation in the blog. I had a feeling you thought I was fingering all adopters of open source as crazy and that couldn’t be further from my view! On the other hand, where open source is concerned, I am totally against three mentalities:
1. The Belligerently Ignorant
2. The Irrationally Exuberant
3. The Self-Indulgently Dishonest
I see #3 preying on #1. When that happens, and doesn’t work out, #1 seeks emotional comfort from the #2 types.
I am a proponent of using any particular technology **if** it solves a problem. If it doesn’t, and some #3 type tries to tell me it does, I get a very upset.
How bad would I have to hate myself to work so hard for over 5 years for a company that provides platform software technology in the Linux environment if I was indeed completely against open source—just because it is open source. That would be pitiful.
Starvation testing is critical. The problem of starvation doesn’t manifest itself until the system is under heavy load, the one time you don’t want it to fail. Thanks for enlightening people! Great job as usual!
Thanks for the kind words, Mike. You did notice that is a 6 year old blog entry, right? 🙂
The absolutely, positively most difficult thing for clusterware to get right is handling nodes that are transitioning in and out of the cluster because of “fizzling” components…such as popping into the network and out before a full transition into the cluster as a member. Race conditions…tough. The totally re-written Oracle clusterware in 11.2 probably comes the closest to getting it right. All prior Oracle clusterware was quite weak. That’s why they had to totally re-write it in a “dot two” release.