Partition, or Real Application Clusters Will Not Work.

OK, that was a come-on title. I’ll admit it straight away. You might find this post interesting nonetheless. Some time back, Christo Kutrovsky made a blog entry on the Pythian site about buffer cache analysis for RAC. I meant to blog about the post, but never got around to it—until today.

Christo’s entry consisted of some RAC theory and a buffer cache contents SQL query. I admit I have not yet tested his script against any of my RAC databases. I intend to do so soon, but I can’t right now because they are all under test. However, I wanted to comment a bit on Christo’s take on RAC theory. But first I’d like to comment about a statement in Christo’s post. He wrote:

There’s a caveat however. You have to first put your application in RAC, then the query can tell you how well it runs.

Not that Christo is saying so, but please don’t get into the habit of using scripts against internal performance tables as a metric of how “well” things are running. Such scripts should be used as tools to approach a known performance problem—a problem measured much closer to the user of the application. There are too many DBAs out there that run scripts way down-wind of the application and if they see such metrics as high hit ratios in cache, or other such metrics they rest on their laurels. That is bad mojo. It is not entirely unlikely that even a script like Christo’s could give a very “bad reading” yet application performance is satisfactory and vise versa. OK, enough said.

Application Partitioning with RAC
The basic premise Christo was trying to get across is that RAC works best when applications accessing the instances are partitioned in such a way as to not require cross-instance data shipping. Of course that is true, but what lengths do you really have to go to in order to get your money’s worth out of RAC? That is, we all recall how horrible block pings were with OPS—or do we? See, most people that loathed the dreaded block ping in OPS thought that the poison was in the disk I/O component of a ping when in reality the poison was in the IPC (both inter and intra instance IPC). OK, what am I talking about? It was quite common for a block ping in OPS to take on the order of 200-250 milliseconds on a system where disk I/O is being serviced with respectable times like 10ms. Where did the time go? IPC.

Remembering the Ping
In OPS, when a shadow process needed a block from another instance, there was an astounding amount of IPC involved to get the block from one instance to the other. In quick and dirty terms (this is just a brief overview of the life of a block ping) it consisted of the shadow process requesting the local LCK process to communicate with the remote LCK process who in turn communicated with the DBWR process on that node. That DBWR process then flushed the required block (along with all the modified blocks covered by the same PCM lock) to disk. That DBWR then posted his local LCK who in turn posted the LCK process back where the original requesting shadow process is waiting. That LCK then posts the shadow process and the shadow process then reads the block from disk. Whew. Note, at every IPC point the act of messaging only makes the process being posted runable. It then waits in line for CPU in accordance with its mode and priority. Also, when DBWR is posted on the holding node, it is unlikely that it was idle, so the life of the block ping event also included some amount of time that was spent while DBWR finished servicing the SGA flushing it was already doing when it got posted. All told, there was quite often some 20 points where the processes involved were in runable states. Considering the time quantum for scheduling is/was 10ms, you routinely got as much as 200ms overhead on a block ping that was just scheduling delay. What a drag.

What Does This Have To Do With RAC?
Christo’s post discusses divide and conquer style RAC partitioning, and he is right. If you want RAC to perform perfectly for you, you have to make sure that RAC isn’t being used. Oh he’s gone off the deep end again you say. No, not really. What I’m saying is that if you completely partition your workload then RAC is indeed not really being used. I’m not saying Christo is suggesting you have to do that. I am saying, however, you don’t have to do that. This blog post is not just a shill for Cache Fusion, but folks, we are not talking about block pings here. Cache Fusion—even over Gigabit Ethernet—is actually quite efficient. Applications can scale fairly well with RAC without going to extreme partitioning efforts. I think the best message is that application partitioning should be looked at as a method of exploiting this exorbitantly priced stuff you bought. That is, in the same way we try to exploit the efficiencies gained by fundamental SMP cache-affinity principals, so should attempts be made to localize demand for tables and indexes (and other objects) to instances—when feasible. If it is not feasible to do any application partitioning, and RAC isn’t scaling for you, you have to get a bigger SMP. Sorry. How often do I see that? Strangely not that often. Why?

Over-configuring
I can’t count how often I see production RAC instances running throughout an entire RAC cluster at processor utilization levels well below 50%. And I’m talking about RAC deployments where no attempt has been made to partition the application. These sites often don’t need to consider such deployment tactics because the performance they are getting is meeting their requirements. I do cringe and bite my tongue however when I see 2 instances of RAC in a two node cluster—void of any application partitioning—running at, say, 40% processor utilization on each node. If no partitioning effort has been made, that means there is cache fusion (GCS/GES) in play—and lots of it. Deployments like that are turning their GbE Cache Fusion interconnect into an extension of the system bus if you will. If I was the administrator of such a setup, I’d ask Santa to scramble down the chimney and pack that entire workload into one server at roughly 80% utilization. But that’s just me. Oh, actually, packing two 40% RAC workloads back into a single server doesn’t necessarily produce 80% utilization. There is more to it than that. I’ll see if I can blog about that one too at some point.

What about High-Speed, Low-Latency Interconnects?
With OLTP, if the processors are saturated on the RAC instances you are trying to scale, high-speed/low latency interconnect will not buy you a thing. Sorry. I’ll blog about why in another post.

Final Thought
If you are one of the few out there that find yourself facing a total partitioning exercise with RAC, why not deploy a larger SMP instead? Comments?

19 Responses to “Partition, or Real Application Clusters Will Not Work.”

Feed for this Entry Trackback Address

1 Alex Gorbachev December 17, 2006 at 9:26 pm

“Oh, actually, packing two 40% RAC workloads back into a single server doesn’t necessarily produce 80% utilization.”

On one site I did measurements of CPU consumption for RAC with Primary/Secondary configuration (using max_instance_count=1) versus RAC with distributed workload when both nodes are active. Two instances only – big SMP machines. Real production traffic and not a simulated application. The result was quite interesting – 70% increase with both nodes active. We thought it might be due to expensive IP stack for interconnect – tried with Veritas LLT – same picture. LM* processes were especially busy. There was close collaboration with Oracle RAC development group to explain this and without disclosing the results I can say that there was no common denominator reached. This was 9i and no special application optimization. By the way, there was no dramatic increase in response time but nobody was prepared to start paying 70% more for Oracle licenses and big SMP machines without significant improvement in availability.

“If you are one of the few out there that find yourself facing a total partitioning exercise with RAC, why not deploy a larger SMP instead?”

Well, how long does reboot of big SMP machine take? Now what about small Linux box?

Reply
2 kevinclosson December 18, 2006 at 12:33 am

“Well, how long does reboot of big SMP machine take? Now what about small Linux box?”

…yes, big SMPs take a while to reboot…and I know where you’re going… I’ll put a question right back at you: how long does it take a surviving RAC instance to go completely green after handling all the clean up from a freshly deceased instance? Never forget the service brown-out that should be expected when a surviving instance is toiling with cleanup from the dead instance… I think the main concern is the user’s experience. If the browser tier and app tier are able to keep the user connected while waiting for an instance to come online (be it RAC or failover HA), then it all comes down to MTBF… amazes me how many developers have naver heard of the SQLCA struct 🙂 Just reconnect!

Hold it, Alex, you’re not one of the confused folks that think RAC can failover an INSERT/UPDATE/DELETE statement, right ? Nah, I know you’re not…I’ve had beers with you so I know you better than that 🙂

Reply
3 Alexander Fatkulin December 18, 2006 at 1:35 am

“If you are one of the few out there that find yourself facing a total partitioning exercise with RAC, why not deploy a larger SMP instead?”

Because that guys discovered that they can no longer do that? 🙂

At some sites we already have fully loaded HP Integrity Supedomes (one cabinet with 32 dual core Itanium 2 CPU, 64Gb RAM) still not being able to handle required workload.

BTW few months ago i saw two fully loaded HP Superdomes RAC’ed together running same stuff – horrible.

That problems has nothing to do with hardware, RAC or such stuff. Extremely bad applications can eat whatever your buy. So i believe “one of the few out there that find yourself facing a total partitioning exercise” already designed something that can’t scale. After realizing that RAC generally only makes unscaleable stuff even more unscaleable 🙂 – they end up doing complete application rewrite.

Reply
4 Alex Gorbachev December 18, 2006 at 5:02 am

“Hold it, Alex, you’re not one of the confused folks that think RAC can failover an INSERT/UPDATE/DELETE statement, right ? Nah, I know you’re not…I’ve had beers with you so I know you better than that”

😀
Well, you asked for comment. We’ve got some! 😉
Obviously, there is no free cheese with RAC as there is practically no applications doing only read requests (except data warehouses but hey what does HA for a data warehouse mean?). Applications, obviously, must be able to handle “recovery” on disconnect and neither TAF nor FAN, FCF or whatever they call it these days is able to take this burden.

By the way, going away from Primary/Secondary configuration I mentioned was caused by an attempt to reduce that brownout time and detailing my message above – it wasn’t significantly improved. For example, cluster reconfiguration time dropped from about 6-9 seconds to 3-4 seconds IIRC. Instance crash recovery dropped somewhat but this is where MTTR target comes into play.

To be fair, it’s possible to reduce brownout time to just seconds and I saw this “technically” working (not in production though) while taking part in one project. The results have been actually presented on the OOW06 but these are rather lab tests at this stage.

Btw, were you also present at the Guinness spill-over action? It took really a while waiting for a cold failover – few minutes to get another pint! 😉 But that’s for me – for Babette it took hours to do the laundry that night.
Imagine if I would have cluster of two Guinness? I could even drink them both at once to distribute the workload. 😉 On the other hand, if the both spill – it might cause more people to go to the laundry… So what’s about 3 or 4 Guinness nodes?

Reply
5 kevinclosson December 18, 2006 at 5:33 am

Alexander,

I cannot argue with you. The largest, most capable systems of every era have always met their match with the worse apps of their day…ask me how I know some time 🙂

Thanks for stopping by.

P.S., Don’t tell anyone, but there is a chance that an IBM System p p595 might be able to endure the torture of that bad application at least a little longer–but only a chance 🙂 Hey, I have my favorites!

Reply
6 kevinclosson December 18, 2006 at 5:35 am

Alex,

I prefer lager.

Reply
7 Christo Kutrovsky December 18, 2006 at 6:32 pm

Kevin,

In my experience so far, every time there is a RAC deployment requirement, it’s for redundancy not for performance. And the reason why RAC is prefered, is because of zero effort failover. The added benefit of “using up that extra hardware we have” just comes in handy.

I totally agree with you for the performance points. Unless you are maxing out the CPU of the largest SMP system you can afford, there is no reason for you to move to RAC for performance reasons.

I think most times people mentally associate a server with it’s storage. And when they think “2 servers” they think twice the capacity, instead of twice the CPU capacity.

The way RAC nodes should be looked at is as an extention to the number of CPUs you can have in your “processing unit” with an extra cost associated.

Of course, you get some extra ram, but that’s not really an issue (I think?).

When I was writing my script, it was more ment to be used for home-grown application in determining which objects have the biggest performance impact in a RAC environments, and thus working on improving those areas.

Reply
8 kevinclosson December 18, 2006 at 6:38 pm

Christo,

I agree with you. I hope it didn’t look as though I was taking a swipe at your blog post. It was just the catalyst for what I wanted to say…

Reply
9 Amir Hameed December 19, 2006 at 3:10 pm

If you are designing a multi-node RAC system and one of the requirements is that outside of the time window when an instace is crashed, the system will not run into degraded mode then you need to take into consideration that the remaining nodes will have enough horse power to sustain the load. For example, in a two-node RAC design, if each node is running at 50% capacity then I do not see anything wrong with it because in the event of an instance crash or scheduled node maintenance, the surviving node has to carry the load that the other node was carrying and will most likely run at 80-85% capacity. Again, this depends on your LOS and the criticality of the application.

Reply
10 kevinclosson December 19, 2006 at 4:09 pm

Yes, Amir, you are right. Like I say, you can’t fit 20lbs of rocks into a 10 lb bag. One one hand this fact is an argument in favor of deploying on a larger number of smaller nodes than a few (or more likely 2) nodes. It is easier to deal with the failover load of, say, 25% should a node in a 4 node cluster take a hit. On the other hand, large node count RAC is difficult to manage in an envrionment where the only cluster-ware software is the database itself. And then, of course, there is the question of RAC scalability. We have a lot of customers with over 4 nodes RAC though.

Reply
11 Noons December 19, 2006 at 9:44 pm

Alex:
“(except data warehouses but hey what does HA for a data warehouse mean?)”

dude, you haven’t met my DW clients yet, have you? They absolutely must not lose ANY of their ETL big data loads and they don’t have the space to keep those extract files waiting for a failed DW. Makes for an interesting life when one finds bugs in the DMT->LMT conversion process, believe me!…
Ah well: there is always an exception to any rule! 😉

Great bunch of replies, everyone. Thanks a lot. Sometimes I wonder if it wouldn’t make sense to use the “federated db” design approach to partitioning applications for RAC?

Reply
12 Prabhakaran April 2, 2007 at 6:00 pm

Have any of you tried to run the new Quest Benchmark Factory 5.0 with the Clustering Option? We are running a TPC-C workload on 2-nodes and we are seeing poor or no scaling. Calling the Vendor to describe the app is of no help at all. They just put our calls on ignore or send our email to the Recycle Bin.

What I suspect is that it is sending queries hitting the same rows to both nodes of our Test System RAC.

But if any of you have had any experience with BMF,
that would be great to know.

Regards,
Prabhakaran.

Reply
13 kevinclosson April 3, 2007 at 9:10 pm

Prabhakaran,

Yes I know BMF. RAC will have a very difficult time scaling TPCC as it is in BMF. You’ll have to use a transaction monitor to route request based on warehouse. That is how all clustered Oracle TPCCs are done.

TPCC doesn’t look like any real world workload, really, so don’t think RAC is broken if it can’t scale that workload.

Reply
14 prabhakaran April 15, 2007 at 2:45 pm

Hi Kevin,

Thanks very much for your response. Based on how BMF
is marketed and documented, it looked like it could scale
right out of the box. We were always under the impression
that we were doing something wrong.

We escalated to Quest folks ( with a couple of threats to dump the product), and they
asked us to create reverse indexes on c_order and c_order_line. They also asked us to create a couple of extra
indexes, which they are yet to send us.

Do you know how to combine the TP monitor with BMF inorder
to get the scaling?

Would you be able to email me at or can I email you directly for more information?

Thanks a lot.
Prabhakaran.

Reply
15 Bernd Eckenfels November 17, 2007 at 11:47 am

Looking for some experiences with RAC for HA. How much is in fact the Brown Out time, does the additional complexity in average make the multi node cluster less reliable as a Failover SMP?

Reply
16 Polarski Bernard March 14, 2008 at 2:16 pm

I got a two node RAC OLTP. Customer has an application, moderate load. Customer want the most HA at lowest cost possible. So here we are with a 2 node Linux RAC, Standard Edition on ASM and ASSM (low cost measn also low management).

On this they put 2 application server per node, and all 4 application server doing the same insert on the same blocks. ASSM takes a bit on it, wave bye-bye to range scann, but at then end of the day I still get some hot blocks. What can I do? Spare me the reverse key index, please I got lobs and root header of index lob associated with lobs are pounded hard.

Its cheap HA but scale very bad and ‘GS buffer busy way’ rules in master. I turning around the system, reading all literrature in quest of an idea but up to now I must say that the match score is : RAC/HA 1 – DBA 0

Reply
17 kevinclosson March 14, 2008 at 4:24 pm

Polarski Bernard,

“…all 4 application server doing the same insert on the same blocks.”

…If you would have said updates of the same block I’d understand the dilemma. I shouldn’t think inserts would pile up with ASSM. How did you determine the inserts are all hammering the same blocks?

Reply
18 Alex Gorbachev March 22, 2008 at 1:10 am

Customer want the most HA at lowest cost possible. So here we are with a 2 node Linux RAC

I’m bit late on this but practical experience and all logic suggests that cheapest HA solution for Oracle is *single* instance. Adding physical standby to the picture improves availability a lot. Though, you won’t be able to run managed standby with SE, of course.

There is no way to have cheap HA with RAC but people threat term HA differently so you can say it depends.

Reply
19 Craig Glendenning March 2, 2009 at 5:49 am

Just my 2.5 cents on Kevin’s statement:

“There are too many DBAs out there that run scripts way down-wind of the application and if they see such metrics as high hit ratios in cache, or other such metrics they rest on their laurels. That is bad mojo.”

Yes. We ought to focus on collecting properly scoped diagnostic data to achieve specific performance goals. Cary Millsap’s book should never stray far from our desks. Viva la “Method R”!

Craig

Reply

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage