OK, that was a come-on title. I’ll admit it straight away. You might find this post interesting nonetheless. Some time back, Christo Kutrovsky made a blog entry on the Pythian site about buffer cache analysis for RAC. I meant to blog about the post, but never got around to it—until today.
Christo’s entry consisted of some RAC theory and a buffer cache contents SQL query. I admit I have not yet tested his script against any of my RAC databases. I intend to do so soon, but I can’t right now because they are all under test. However, I wanted to comment a bit on Christo’s take on RAC theory. But first I’d like to comment about a statement in Christo’s post. He wrote:
There’s a caveat however. You have to first put your application in RAC, then the query can tell you how well it runs.
Not that Christo is saying so, but please don’t get into the habit of using scripts against internal performance tables as a metric of how “well” things are running. Such scripts should be used as tools to approach a known performance problem—a problem measured much closer to the user of the application. There are too many DBAs out there that run scripts way down-wind of the application and if they see such metrics as high hit ratios in cache, or other such metrics they rest on their laurels. That is bad mojo. It is not entirely unlikely that even a script like Christo’s could give a very “bad reading” yet application performance is satisfactory and vise versa. OK, enough said.
Application Partitioning with RAC
The basic premise Christo was trying to get across is that RAC works best when applications accessing the instances are partitioned in such a way as to not require cross-instance data shipping. Of course that is true, but what lengths do you really have to go to in order to get your money’s worth out of RAC? That is, we all recall how horrible block pings were with OPS—or do we? See, most people that loathed the dreaded block ping in OPS thought that the poison was in the disk I/O component of a ping when in reality the poison was in the IPC (both inter and intra instance IPC). OK, what am I talking about? It was quite common for a block ping in OPS to take on the order of 200-250 milliseconds on a system where disk I/O is being serviced with respectable times like 10ms. Where did the time go? IPC.
Remembering the Ping
In OPS, when a shadow process needed a block from another instance, there was an astounding amount of IPC involved to get the block from one instance to the other. In quick and dirty terms (this is just a brief overview of the life of a block ping) it consisted of the shadow process requesting the local LCK process to communicate with the remote LCK process who in turn communicated with the DBWR process on that node. That DBWR process then flushed the required block (along with all the modified blocks covered by the same PCM lock) to disk. That DBWR then posted his local LCK who in turn posted the LCK process back where the original requesting shadow process is waiting. That LCK then posts the shadow process and the shadow process then reads the block from disk. Whew. Note, at every IPC point the act of messaging only makes the process being posted runable. It then waits in line for CPU in accordance with its mode and priority. Also, when DBWR is posted on the holding node, it is unlikely that it was idle, so the life of the block ping event also included some amount of time that was spent while DBWR finished servicing the SGA flushing it was already doing when it got posted. All told, there was quite often some 20 points where the processes involved were in runable states. Considering the time quantum for scheduling is/was 10ms, you routinely got as much as 200ms overhead on a block ping that was just scheduling delay. What a drag.
What Does This Have To Do With RAC?
Christo’s post discusses divide and conquer style RAC partitioning, and he is right. If you want RAC to perform perfectly for you, you have to make sure that RAC isn’t being used. Oh he’s gone off the deep end again you say. No, not really. What I’m saying is that if you completely partition your workload then RAC is indeed not really being used. I’m not saying Christo is suggesting you have to do that. I am saying, however, you don’t have to do that. This blog post is not just a shill for Cache Fusion, but folks, we are not talking about block pings here. Cache Fusion—even over Gigabit Ethernet—is actually quite efficient. Applications can scale fairly well with RAC without going to extreme partitioning efforts. I think the best message is that application partitioning should be looked at as a method of exploiting this exorbitantly priced stuff you bought. That is, in the same way we try to exploit the efficiencies gained by fundamental SMP cache-affinity principals, so should attempts be made to localize demand for tables and indexes (and other objects) to instances—when feasible. If it is not feasible to do any application partitioning, and RAC isn’t scaling for you, you have to get a bigger SMP. Sorry. How often do I see that? Strangely not that often. Why?
I can’t count how often I see production RAC instances running throughout an entire RAC cluster at processor utilization levels well below 50%. And I’m talking about RAC deployments where no attempt has been made to partition the application. These sites often don’t need to consider such deployment tactics because the performance they are getting is meeting their requirements. I do cringe and bite my tongue however when I see 2 instances of RAC in a two node cluster—void of any application partitioning—running at, say, 40% processor utilization on each node. If no partitioning effort has been made, that means there is cache fusion (GCS/GES) in play—and lots of it. Deployments like that are turning their GbE Cache Fusion interconnect into an extension of the system bus if you will. If I was the administrator of such a setup, I’d ask Santa to scramble down the chimney and pack that entire workload into one server at roughly 80% utilization. But that’s just me. Oh, actually, packing two 40% RAC workloads back into a single server doesn’t necessarily produce 80% utilization. There is more to it than that. I’ll see if I can blog about that one too at some point.
What about High-Speed, Low-Latency Interconnects?
With OLTP, if the processors are saturated on the RAC instances you are trying to scale, high-speed/low latency interconnect will not buy you a thing. Sorry. I’ll blog about why in another post.
If you are one of the few out there that find yourself facing a total partitioning exercise with RAC, why not deploy a larger SMP instead? Comments?