We are all at different skill and technical sophistication levels so this post will look like fluff to some readers. This is just a quick post to show a clear depiction of a SAN configuration that wouldn’t do anyone any favors in an Oracle environment.
I showed some snapshots of the lab gear at my disposal in this blog entry. There are a lot of SANs in the lab here, but I can’t say which brand of SAN array I’m blogging about today, and honestly, this problem is not necessarily the architecture of the particular SAN array controller in question. Yes, this is a low-end array, but the concepts I’m blogging about are relevant through the high-end. The following vmstat(1) output shows one of the problems I’m blogging about:
# vmstat 10 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 3 0 31577944 81948 653352 0 0 96 597 13 34 0 7 89 4 5 4 0 31550968 82204 653096 0 0 73732 89913 2919 57992 0 10 60 30 0 4 0 31576760 82332 652968 0 0 74555 98056 2942 55904 0 10 60 30 1 6 0 31553824 82460 652840 0 0 80289 85870 2831 61435 0 10 63 28 0 4 0 31576968 82460 652324 0 0 79466 85397 2860 60537 0 10 62 28 8 0 0 31573368 82588 654260 0 0 41279 46733 1917 36035 0 32 53 15 8 0 0 31572896 82588 653744 0 0 11613 13344 1239 13387 0 49 46 5 5 0 0 31572528 82588 654260 0 0 8875 9039 1193 10727 0 47 49 4 7 0 0 31572280 82588 653744 0 0 6559 7057 1146 9823 0 47 49 4 6 0 0 31572176 82588 653744 0 0 6556 7048 1156 9121 0 50 47 3 4 0 0 31572072 82588 654260 0 0 5736 5956 1136 7955 0 47 50 3 6 0 0 31571576 82588 654260 0 0 4205 5529 1126 7851 0 47 51 2 8 0 0 31571584 82588 654260 0 0 5960 5264 1132 7982 0 48 49 3 7 0 0 31570896 82652 654712 0 0 4239 5167 1119 8086 0 47 51 2 11 0 0 31569968 82652 654712 0 0 5116 4995 1124 7437 0 55 43 2 7 0 0 31570184 82684 653648 0 0 4782 5082 1125 7880 0 54 44 2
This vmstat(2) output shows that Oracle was getting upwards of 165MB/s (via a single 2Gb FCP path) of combined read/write throughput until the array cache reached its saturation point. At that point, the throughput of the array was relegated to the bandwidth of the spindles being accessed.
Its All About Balance
I have a pretty simple view of storage and it is worth quoting myself:
Physical I/O is a necessary evil. Bottlenecking downwind of the SAN array cache is silly. Bottlenecking above the disks is foolishness.
Bottlenecking The Array (Silliness)
Configuring LUNs with insufficient spindle count to handle the cache miss (read-through) and write-back overhead is what I call bottlenecking the SAN array. This sort of bottleneck is strictly a configuration issue and therefore falls into the silliness category. The vmstat(1) output above is an example of bottlenecking the array. That is, the array could certainly deliver more throughput, but the spindle count was holding it back. Again, I’m not going to talk about what vendor’s SAN array controller it was because as it turns out any of them will act this way under these circumstances. I had a LUN of some 500GB that consisted of very few spindles so it was no surprise to me that the throughput was lousy. This LUN had to be configured this way due to other constraints on the array’s usage (read Kevin had to share some hardware). I got to live the pain I blogged about recently in this post about capacity versus spindle count. I could configure a more reasonable number of drives and get around this performance problem, but only up to the point where the tables turn and the array starts to bottleneck the disks.
Bottlenecking The Disks (Foolishness)
What do I mean by this? Well, most modern SAN arrays will bottleneck long before your application realizes the full bandwidth potential of the all the drives the array can support. I call this bottlenecking the disks and it falls into the foolishness category. This is basic foolishness on the SAN array vendor’s part. Why build storage arrays with horrible imbalance between capacity and performance?
Let me go over a typical case of a SAN array that bottlenecks the disks. Here is a link to some technical detail of a high end array. This particular array supports roughly 140TB of capacity when configured with the maximum spindle count (1024) of 146GB drives. On page 3 of the document, the vendor states the first figure related to bandwidth by citing the internal bandwidth of 15GB/s. What does that mean to the servers connected to this array? Reading further (pg. 15) the document states that there is some 10.6GB/s of aggregate data bandwidth and 5.3GB/s bandwidth for control. This means that if you get all the king’s horses and all the king’s men to work it out, you could feed 10.6GB/s of data to your servers. Indeed, 10.6GB/s is a great deal of bandwidth. So what am I talking about? Taking another look at page 15 we see the vendor’s claims that a single 146GB drive can deliver 99.9MB/s ( max theoretical). If you wanted to drive all these spindles at full throttle, they would theoretically deliver 99.9GB/s (1024 disks X 99.9MB/s) which is much less than the maximum theoretical data bandwidth of the array. In fact, if you drove “only” about 108 of these spindles at full throttle you’d saturate the array. I quoted the word only because 108 is a lot of disks, but that is only 10% of what the array supports.
If Oracle needs to do a physical I/O, let’s not bottleneck somewhere in the pipe! Think about it this way, in the technology stack I’m discussing (Oracle, high-end SAN array, etc), hard drive technology represents the simplest and least advanced component. That is, while hard drives are faster than they were 10 years ago, they have not fundamentally changed. They are still round and brown and they spin. Wouldn’t you think homely old disk drives would be the bane of performance? They aren’t. If we could drive all our disks at their maximum throughput, we’d be in a much better place performance-wise.
Summary
Hard drives are miserably low-tech necessary evils. Will we ever get a storage architecture where the more sophisticated components don’t make matters worse? Yes, we will. I’ll tell you all more as time passes. In the meantime, I bet dollars to donuts that the paltry 128 drives used in the TPC-H benchmark I blogged about were being driven at or near full bandwidth. That, my dear readers, is cool.
I think the idea is that if you are doing any amount of random I/O, the per disk throughput drops enormously, and you will not bottleneck the disks in a typical configuration.
Even if it’s sequential, you will be driving many many servers at the data rates you are talking about. Maybe there is an element of randomness introduced by ‘context switching’ on the disks with that many streams, if that makes any sense?
Hello Kevin, really interesting article.
Just a question, I work on a DW (30TB) and our DMX system deliver a 30MB/s rate for a single full scan (striped LUNs, static AIX Stripping, multiple datafiles on multiple CIO FS) … Could you give me an advice and bottlenecks possibilities…? (sorry I’m french…)
Best Regards
Sorry I forgot, DMX disks are 10000rpm ones.
Dba_z,
The type of disk doesn’t matter. You can get 30MB/s from a single process on a laptop reading a SATA 5400 RPM drive. Either a) 30 MB/s is a typo, or b) you have something really broken or c) you’re not using parallel query or d) you’re not using Oracle at all or e) this is a prank and I’m gullible 😦
Dba-z,
The type of disk doesn’t matter. You can get 30MB/s from a single process on a laptop reading a SATA 5400 RPM drive. Either a) 30 MB/s is a typo, or b) you have something really broken or c) you’re not using parallel query or d) you’re not using Oracle at all or e) this is a prank and I’m gullible 😦
Thanks Kevin for your answer. More détails, i’m using oracle 10g of course on a datawarehouse (30TB). No paralell query because of the lack of cpu Power. I posted this message because i’m not a san specialist, it’s not under my responsability in my firm but i tried a raid 0 on my pc at home and got a 610MB/s (3 crucial SSD C300). So it is a problem for me when i see an EMC device that costs much more and only serve a 30mb/s fts even if There is only one query on the database… For information the hardware is a 4 core power6 at 4,2GHz.
Power6 not enough cpu power for parallel query? Wow. You really should enable it. There is no reason not to. Your company is not getting their money’s worth, but it is certainly not the SAN’s fault. A single Oracle foreground process without parallel query is going to crawl along badly. With Oracle Database 10g EE I think the list price money to Larry is about a quarter of a million dollars. Please do lobby the persons in charge there to use the software they’ve paid for.
I wish you were right Kevin but with only 4 cores (of course dual thread…), Cognos, SAS, and had hoc analytics queries, the resources are consumed very quickly… I work in an insurance with tenth of users doing fts all the time, so 20 sessions doing hard work is very usual… I understand in your post that there is no solution to optimize the full scan table without parallel query… I’m disappointed!! 😦 But thank you for the time you spend Kevin. I’m going to investigate on parallel query to see if there’s not a way to minimize its impact on my DW CPU resources.
Best regards
Dba_z,
But 30MB/s isn’t giving the database a fair share of anything. I’m not suggesting giving PQO full reign of the box. Even if you set max_parallel_servers and min_parallel_servers to 1 you’d get better disk I/O. Aren’t those ad-hoc queries just waiting for physical I/O all the time? Are any SLAs being met for any of these apps on that little box?
We meet now the mixed workload problem that gartner talk about in DW magic quadrants. On my DW, customers are requesting using full table scan for :
– send data for datamining to SAS
– control the data (data quality processes)
Another kind of workload is of course data integration : lots of full scans in operating data store un data staging area for transformation.
So there’s a lack of SLA of course, but all documents I read mentionned that customers are not satisfied about it in DW. “DW is always top slow”…
So SLA Garanty is made by a copy for critical datamart with star transformation etc…
Our goal at this Time is to divide the weekly data integration process of the weekend
By divide I wanted to say divide “the time”. Sorry, I couldn’t do corrections and put details to my post because of an unwanted “publish” on my smartphone…*To answer your question about waits, if course, waits are mostly for IO, when there are CPU waits, they are commonly due to bad SQL so we teach them (hard work!! 😉 ).
So you are short on CPU and want to fix an I/O bottleneck? You might not get any improved throughput if you do so.
Sorry my smartphone screen is too small (or my fingers to fats lol)
Just a détail, the datamart copy for SLA is Done on another hardware platform.
At that Time, all informatica workflows have been optimized. So we must find a solution to speed those workloads…
Hope not have been too long…
Regards! Hope you spend good time in Las Vegas (maybe a magic trick to take full advantage of two FC (4GB per) with few FTS like I do it with m’y raid 0 on my damned PC…
Thanx