The “Dread Factor”, Multi-vendor Support, Unbreakable Linux.

Dread the Possible, Ignore the Probable
“One throat to choke”, is the phrase I heard the last time I spoke with someone who went to extremes to reduce the number of technology providers in their production Oracle deployment. You know, Unbreakable Linux, single-source support provider, etc. I’m sorry, but, if you are running Oracle on Linux there is no way to get single-provider support. We all find this out sooner or later. Sure, you can send your money to a sole entity, but that is just a placebo. If I thought my life depended on single-provider support, I’d buy an IBM System i solution (AS400)—soup to nuts. At least I’d get close.

With Linux there is always going to be multiple providers because it runs on commodity hardware. You then add storage (SAN array, switches, HBAs), load the OS and Oracle and other software. There you go—multiple providers. So why is it that sometime people get a comfort from this theory of single-provider support on the software (OS and Oracle only of course) side of things? Is it a reality?

Dread Factor
No, single-provider support with Oracle on Linux is not a reality. That is why serious software providers and their careful customers rely on TSANet to ensure all parties play by the rules and do not start pointing fingers at the expense of the customer. Oracle is a participant in TSANet, so is PolyServe.

I was reading an interesting magazine article—also available online—about how we humans fear the wrong things. You know, things like fearing a commercial airliner fatality more than an auto fatality—the latter taking 500-fold more lives per year. The article explains why. We dread an airliner crash more. The article points out:

[…] the more we dread, the more anxious we get, and the more anxious we get, the less precisely we calculate the odds of the thing actually happening. “It’s called probability neglect,”

What Does This Have To Do With Oracle?
Well, we fear how “helpless” we might be in a case where the OS or third party platform software provider is pointing at Oracle and Oracle is pointing back. By the way, have you ever finger-pointed at a 800lb gorilla? Yes that is a possible scenario. Is that somehow more calamitous than working with Oracle on a clear, concise Oracle-only bug (e.g., some ORA-0600 crash problem)? Probably not, but fear of the former is an example of what the magazine article calls the Dread Factor.

New Year’s Resolution: Fear the Probable
We have a Wall Street customer that does not run Oracle on our Database Utility for Oracle RAC, in their RAC solution but do use our scalable file serving in their ETL workflow. They run Oracle on Itanium Linux and we don’t do Itanium. But, since we are in there, I know a bit about their operations. In the month of November 2006, one of their operations managers told me they had nearly 90 Oracle TARs open—half of which where ORA-00600/ORA-07445 problems. All those TARs were affecting a single application—a single RAC database. Yes, it is conceivable that they also have also faced a multi-vendor problem (e.g., HBA firmware/Red Hat SCSI midlayer) at some point in this deployment. Do you think they really care? In this shop, the database tier is 100% Unbreakable Linux—the old style, not the new style. The old style Unbreakable Linux being RHEL with Oracle and no third-party kernel loadable modules. That’s them–they have a “single throat to choke”. How do you think that is working out for them? It hasn’t made a bit of difference.

Oracle is an awesome database. It is huge and complex. You are going to hit bugs so it might be a good New Year’s resolution to fear the probable more then the possible. Get the most stable, managable, supported configuration you can so that you are not dealing with day to day headaches between those probable bugs. That is, don’t hinge your deployment on some possible support finger pointing match. Real, difficult, single-vendor bugs are most probable. Choose your partners well for those possible bugs.

A Case Study
The majority of the suse-oracle email list participants have the “no-third-party” model deployed. They are, if you will, the poster children for Unbreakable Linux. So I keep an eye out there to see how the theory plays out in reality. Let’s take a peek. In a recent thread about an Asynchronous I/O problem in the Linux kernel, the poster wrote:

We already tried this…opened a TAR with Oracle, opened an issue with Novell…got 2 fixes from Novell, but both are not helping around the bug. The database crashes after approx. 1 week of heavy load and you have to restart the machine to free the ipc-resources.

Remember that with an Unbreakable Linux deployment, if you hit a Linux kernel problem you can call Oracle or the provider of your Linux distribution. This person tried both, but the saga continued:

[…] we filed a bug…with both parties, Novell AND Oracle.We escalated this case at Novell, because it’s a kernel bug…no change for the last 4-6 weeks. But…as you see…no solution after about 3 months…

Since Linux is open source, the code is open to all for reading. I’ve blogged before about the dubious value in being able to read the source for the OS or layers such as clustered filesystems since an IT shop is not likely to fix the problem themselves anyway. The customer having this async I/O problem took advantage of that “benefit”:

I took a deep look into the kernel-code, especially the part of the bug in aio.c As far as i see, it looks like a list-corruption of the list of outstanding io-requests. So i don’t think that it is driver-specific…it looks like a general bug.

But, as I routinely point out, having the source really doesn’t help an IT shop much as this installment on the thread shows:

It’s very unfortunate that this bug (bz #165140) is still not resolved
as both Oracle and SUSE eng. teams are looking into problem.

An Historical Example of Good Multi-Vendor Support
Back in the 1990’s Veritas, Oracle and Sun got together to build a program called VOS to ensure their joint customers get the handling they deserve. Kudos to Oracle and Sun. That was typical of Oracle back in the Open Systems days. Things were a lot more “open” back then.

I participate in the oracle-l list. There was a recent thread there about the dreaded “finger-pointing” illusion. In this post a list participant set the record straight. His post points out that having more than “one through to choke” is better than being all alone:

In the context of clustering, even if you eliminate the third-party cluster-ware products, you still have the other pieces of the pie, like the OS, the storage (SAN, etc.), the interconnect, etc., so the finger-pointing will not go away. I have worked with the VOS support many times in the past and I can tell you that in each conference call, VERITAS support never pointed fingers towards anyone. In fact, their support people were so competent that they even identified issues that were related to SAN and even the analysts from the storage SAN company were not able to identify them.

Lessons From Real Life
Multi-vendor support is a phenomenon across all industries. A good friend of mine has a real job and does real work for a living—dangerous work, with huge dangerous equipment that he owns. He knows that there are certain things he has to do with his machinery that substantially increase the probability of something going wrong. In those cases, he doesn’t fret about the possibility that there may be some political outcome. He focuses on the probable.

A bit over a year ago he experienced “the probable” and took photos for me. While moving a 60,000+ lb piece of machinery, he hit a patch of ice and yes, 30 ton track vehicles do slide on ice just like your co-worker’s red sports car.

In the following shot, the machinery had just slipped off the road so he called in another of his pieces to help.

In the next shot they had worked at the problem until the tracks were headed in the right direction and the tether was freshly cut loose. He said the anxiety was so thick you could cut it with a knife. It is quite probable he is right. Then again, it is possible he was exaggerating. I’ll let you be the judge.

I’ll blog another time about where that machine had to go after that photo…it wasn’t pretty.

	Optimize replication… on Introducing SLOB – The S…
	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage