I’ve never met Mike Ault, but some friends of mine, who are fellow OakTable Network members, say he’s a great guy and I believe them. Mike works at Texas Memory Systems and I know some of those guys as well (Hi Woody, Jamon). Pleasantries aside, I have to call out some of the content Mike posted on a recent blog entry about HP Oracle Database Machine and Exadata Storage Server. Just because I blog about someone else’s posted information doesn’t mean I’m “out to get them.” Mike’s post made it clear I need to address a few things. Mike’s post was not vicious anti-Technical Marketing by any means, but it was riddled with inaccuracies that deserve correction.
Errata
While these first two errata I will point out may seem moot to many readers, I think accuracy is important if one intends to contrast one technology offering against another. After all, you won’t find me posting blog entries about Texas Memory Systems SSD being based on core memory or green Jell-O.
The first error I need to point out is that Mike refers to Oracle Exadata Storage Server cells as “block” or “blocks” 8 times in his post.
The second error I need to point out is rooted in the following quote from Mike’s blog entry:
These new storage and database devices offer up to 168 terabytes of raw storage with 368 gigabytes of caching and 64 main CPUs
That is partially true. With the SATA option, the HP Oracle Database Machine does offer 168TB of gross disk capacity. The error is in the “368 gigabytes of cache” bit. The HP Oracle Database machine does indeed come with 8 Real Application Clusters hosts in the Database grid configured with 32GB RAM and 14 Exadata Storage Server cells with 8GB each. However, it is entirely erroneous to suggest that the entirety of physical memory across both the Database grid and Storage grid somehow work in unison as “cache.” It’s not that the gross 368GB (8×32 + 14×8) isn’t usable as cache. It’s more the fact that none of it is used as user-data cache–at least not cache that somehow helps out with DW/BI workloads. The notion that it makes sense to put 368GB of cache in front of, say, a 10TB table scan, and somehow boost DW/BI query performance, is the madness that Exadata aims to put to rest. Here’s a rule:
If you can’t cache the entirety of a dataset you are scanning, don’t cache at all.
– Kevin Closson
Cache, Gas or a Full Glass. Nobody Rides for Free.
No, we don’t use the 8x32GB physical memory in the Database grid as cache because cycling, say, the results of a 2TB table scan through 368GB aggregate cache would do nothing but impede performance. Caching costs, and if there are no cache hits there is no benefit. Anyone who claims to know Oracle would know that parallel query table scans do not pollute the shared cache of Oracle Database. A more imaginative, and correct, use for the 32GB RAM in each of the hosts in the Database grid would be for sorting, joins (hash, etc) and other such uses. Of course you don’t get the entire 32GB anyway as there is an OS and other overhead on the server. But what about the 8GB RAM on each Oracle Exadata Storage cell?
One of the main value propositions of Oracle Exadata Storage Server is the fact that lower-half query functionality has been offloaded to the cells (e.g., filtering, column projection, etc). Now, consider the fact that we can scan disks in the SAS-based Exadata Storage Server at the rate of 1GB/s. We attack the drives with 1MB physical reads and buffer the read results in a shared cache visible to all threads in the Oracle Exadata Storage Server software. To achieve 1GB/s with 1MB I/O requests requires 1000 physical I/Os per second. OK, now I’m sure all the fully-cached-conventional-array guys are going to point out that 1000 IOPS isn’t worth talking about, and I’d agree. Forget for the moment that 1GB/s is in fact very close to the limit of data transfer many mid-range storage arrays have to offer. No, I’m not trying to get you excited about the 1GB/s because if that isn’t enough you can add more. What I’m pointing out is the fact that the results of 1000 IOPS (each 1MB in size) must be buffered somewhere while the worker threads rip through the data blocks applying filtration and plucking out cited columns. That’s 125 1MB filtration and projection operations per second per processor core. There is a lot going on and we need ample buffering space to do the offload processing.
Mike then moved on to make the following statement:
The Oracle Database Machine was actually designed for large data warehouses but Larry assured us we could use it for OLTP applications as well. Performance improvements of 10X to 50X if you move your application to the Database Machine are promised.
I’m not going to write guarantees, but no matter, that statement only lead in to the following:
This dramatic improvement over existing data warehouse systems is provided through placing an Oracle provided parallel processing engine on each Exadata building block so instead of passing data blocks, results are returned. How the latency of the drives is being defeated wasn’t fully explained.
Exadata Storage Server Software == Oracle Parallel Query
Folks, the Storage Server software running in the Exadata Storage Server cell is indeed parallel software and threaded, however, it is not entirely correct to state that there is a “parallel processing engine” that returns “results” from Exadata cells. More correctly, we offload scans (a.k.a. Smart Scan) to Exadata cells. Smart Scan technology embodies I/O, filtration, column projection and rudimentary joins. Insinuating otherwise makes Exadata out to be more of a database engine than intelligent storage and there is more than a subtle difference between the two concepts. So, no, “results” aren’t returned, filtered rows and projected columns are. That is not a nit-pick.
DW/BI and I/O Latency
Mike finished that paragraph with the comment about how Oracle Exadata Storage Server “defeats” (or doesn’t) drive latency. I’ll simply point out that drive latency is not an issue with DW/BI workloads. The problem (addressed by Exadata) is the fact that attaching just a few modern hard drives to a conventional storage array leaves you with a throughput bottleneck. Exadata doesn’t do anything for drive latency because, shucks, the disks are still round, brown spinning thingies. Exadata does, however, make a balanced offering that that doesn’t bottleneck the drives.
Mike continued with the following observation:
So in a full configuration you are on the tab for a 64 CPU Oracle and RAC license and 112 Oracle parallel query licenses
Yes, there are 64 processor cores in the Database grid component of the HP Oracle Database Machine, but Mike mentioning the 112 processor cores in the Exadata Storage Server grid is clearly indicative of the rampant misconception that Exadata Storage Server software is either some, most, or all of an Oracle Parallel Query instance. People who have not done their reading quickly jump to this conclusion and it is entirely false. So, mentioning the 112 Exadata Storage Server grid processors and “Oracle parallel query licenses” in the same breath is simple ignorance.
Mike continues with the following assertion:
Targeting the product to OLTP environments is just sloppy marketing as the system will not offer the latency needed in real OLTP transaction intensive shops.
While Larry Ellison and other important people have stated that Exadata fits in OLTP environments as well as DW/BI, I wouldn’t say it has been marketed that way, and certainly not sloppily. Until you folks see our specific OLTP numbers and value propositions I wouldn’t set out to craft any positioning pieces. Let me just say the following about OLTP.
OLTP Needs Huge Storage Cache, Right?
OLTP is I/O latency sensitive, but mostly for writes. Oracle offers a primary cache in the Oracle System Global Area disk buffer cache. Applications generaly don’t miss SGA blocks and immediately re-read them at a rate that requires sub-millisecond service times. Hot blocks don’t age out of the cache. Oracle SGA cache misses generally access wildly random locations, or result in scanning disk. So, for storage cache to offer read benefit it must cover a reasonable amount of the wildly, randomly accessed blocks. The SGA and intelligent storage arrays share a common characteristic: the same access patterns that blow out the SGA cache also blow out storage array cache. After all, architecturally speaking, the storage array cache serves as a second-level cache behind the SGA. If it is the same size as the SGA it is pretty worthless. If it is, say, 10 times the size of the SGA but only 1/50th the size of the database it is also pretty useless-with the exception of those situations when people use storage array cache to make up for the fact that they are using, say, 1/10th the number of drives they actually need. Under-provisioning spindles is not good but that is an entirely different topic.
I know there are SAN array caches in the terabyte range and Mike speaks of multi-terabyte FLASH SSD disk farms. I suppose these are options-for a very select few.
Most Oracle OLTP deployments will do just fine running against non-bottlenecked storage with a reasonable amount of write-cache. Putting aside the idea of an entirely FLASH SSD deployment for a moment, the argument about storage cache helping OLTP boils down to what percentage of the SGA cache misses can be satisfied in the storage array cache and what overall performance increase that yields.
The Eye of a Needle
Recently, I was looking at the specification sheet for a freshly released mid-range Fibre Channel SAN storage array that supports up to 960 disk drives plumbed through a two-headed controller. The specification sheet shows a maximum of 16GB cache per storage processor (up to two of them). I should think the cache is mirrored to accommodate storage processor failure-maybe it isn’t, I don’t know. If it is mirrored, let’s pretend for a moment that mirroring storage processor cache is free even with modify-intensive workloads (subliminal man says it isn’t). Given this example, I have to ask who thinks 16GB of storage array in front of hundreds of drives offers any performance increase? It doesn’t, so let’s put to rest the OLTP storage cache benefit argument.
But Mike Wasn’t Talking About Storage Array Cache
Right, Mike wasn’t talking about storage array cache benefit in an OLTP environment, but he was talking about nosebleed IOP rates from FLASH SSD. When referring to Exadata, Mike stated (quote):
What might be an alternative? Well, how about keeping your existing hardware, keep your existing licenses, and just purchase solid state disks to supplement your existing technology stack? For that same amount of money you will shortly be able to get the same usable capacity of Texas Memory Systems RamSan devices. By my estimates that will give you 600,000 IOPS, 9 GB/sec bandwidth (using fibre Fibre Channel , more withor Infiniband), 48 terabytes of non-volatile flash storage[S1] , 384 GB of DDR cache and a speed up of 10-50X depending on the query (based on tests against the TPCH data set using disks and the equivalent Ram-San SSD configuration).
OK, there is a lot to dissect in that paragraph. First there is the attractive sounding 600,000 IOPS with sub-millisecond response time. But wait, Mike suggests keeping your existing hardware. Folks, if you have existing hardware that is capable of driving OLTP I/O at the rate of 600,000 IOPS I want to shake your hand. Oracle OLTP doesn’t just issue I/O. It performs transactions that hammer the SGA cache and suffer some cache misses (logical to physical I/O ratio). The CPU cost wrapped around the physical I/O is not trivial. Indeed, the idea is to drive up CPU utilization and reduce physical I/O through schema design and proper SGA caching. Those of you who are current Oracle practitioners are invited to analyze your current production OLTP workload and assess the CPU utilization associated with your demonstrated physical I/O rate. If you have an OLTP workload that is doing more than, say, 5000 IOPS (physical) per processor core and you are not 100% processor-bound, tell us about it.
Yes, there are tricked out transactional benchmarks that shave off real-world features and code path and hammer out as much as 10,000 IOPS per processor core (on very powerful CPUs), but that is not your workload, or anyone else’s workload that reads this blog. So, if real OLTP saturates CPU at, say, 5000 IOPS I have to wonder what your “existing hardware” would look like if it were also able to take advantage of 600,000 IOPS. That would be a very formidable Database grid with something like 120 CPUs. Remember, Mike was talking about using existing hardware to take advantage of SSD instead of Exadata. If you have a 120 CPU Database grid, I suspect it is so critical that you wouldn’t be migrating it to anything. It is simply too critical to mess with. I should hope. Oh, it’s actually more like about 2000 IOPS per processor core in real life anyway, but that that doesn’t change the point much. And Exadata isn’t really about OLTP.
Let’s focus more intently on Mike’s supposition that an alternative to Exadata is “keeping your existing hardware” and feeding it 9GB/s from SSD. OK, first, that is 36% less I/O bandwidth than a single HP Oracle Database Machine can do, but let’s think about this for a moment. The Fibre Channel plumbing required for the Database grid to ingest 9GB/s is 23 active 4GFC FC HBAs at max theoretical throughput. That’s a lot of HBAs, and you need systems to plug them into. Remember, this is your “existing system.”
How much CPU does your “existing hardware” require to drive the 23 FC HBAs? Well, it takes a lot. Yes, I know you can use just a blip of CPU to mindlessly issue I/O in such a fashion as Orion or some pure I/O subsystem invigoration like dd if=/dev/zero of=/dev/sda bs=1024k, but we are talking about DW/BI and Oracle. Oracle actually does stuff with the data returned from an I/O call. With non-Exadata storage, the CPU cost associated with I/O (e.g., issuing, reaping), filtration, projection, joining, sorting, aggregation, etc is paid by the Database grid. So your “existing system” has to be powerful enough to do the entirety of SQL processing at a rate of 9GB/s. Let’s pretend for a moment that there existed on the market a 4-socket server that could accommodate 23 FC HBAs. Does anyone think for a moment that the 4 processors (perhaps 8 or 16 cores) can actually do anything reasonable with 9GB/s I/O bandwidth? A general rule is to associate approximately 4 processor cores with each 4GFC HBA (purposefully ignoring trick benchmark configurations). I think it looks like “your existing system” has about 96 processor cores.
A Chump Challenge
I’d put a HP Oracle Database Machine (64-core/14 cell) up against a 96-core/9GBPS FLASH SSD system any day of the week. I’d even give them 128 Database tier CPUs and not worry.
People keep forgetting that scans are offloaded to Exadata with the HP Oracle Database Machine. People shouldn’t craft their position pieces against Exadata by starting at the storage-regardless of the storage speeds and feeds.
It will always take more Database grid horsepower, in a non-Exadata environment, to drive the same scan rates offered by the HP Oracle Database Machine.
FLASH SSD
Did I mention that there is nothing (technically) preventing us from configuring Exadata Storage Server with 3.5″ FLASH SSD drives? Better late than never, but it isn’t really worth mentioning at this time.
>> So, if real OLTP saturates CPU at, say, 5000 IOPS I have to wonder what your “existing hardware” would look like if it were also able to take advantage of 600,000 IOPS.
I have to wonder also . . .
If I understand your argument correctly, you are suggesting that Oracle’s CPU-intensive internal machinations impose a limit on the I/O throughput, such that, no matter how fast the storage unit, CPU become the bottleneck.
Using super-duper fast storage will overload the processors, and that, above some arbitrary point, increased “true” Oracle I/O throughput will be a function of both CPU and device latency . . .
This reminds me of the old T1 RAM argument, except it’s backwards! The server architects note that CPU speed outpaces RAM speed, and hence, the RAM must be co-located near the processors to keep the CPU’s running efficiently.
*************************************************************
>> If you have an OLTP workload that is doing more than, say, 5000 IOPS (physical) per processor core and you are not 100% processor-bound, tell us about it.
I wonder what the “real” limit is?
It’s not uncommon to see database shift bottlenecks from I/O-bound to CPU-bound after getting faster storage, but you are correct, CPU does indeed impose a strict limit the upside benefits of super-fast storage. . . .
Hi Don,
It has always been the case that saturated CPUs cannot ingest any more data.
I don’t understand your tangent about T1 memory locality. CPUs stall for “distant” memory. Stalled CPUs are “busy”. If you reduce memory latency you get a better ratio of cycles to instruction (more work, less time). Making CPUs more efficient frees up cycles to demand more I/O, more I/O demands more CPU and ’round and ’round we go.
Back to the topic at hand, however, the point I’m making is that the CPUs will ultimately control the IOPs. If that number is 40,000 IOPS for your system, so be it. If the storage from whence you get the 40,000 is a 600,000 IOPS-capable storage configuration you have, um, a little headroom to spare.
Back to the point of Exadata in this regard, you have to understand that the choice is to either build out your Database grid ***AND*** your storage system to scale up, or build out Storage and complement it with a sufficient Database grid. The former being traditional Database grid attached to traditional storage and the latter being Exadata with scans offloaded to storage. I have queries that use less than a single Harpertown Xeon when performing a 4 table join with 6GB/s scan rate underpinnings. Offload is offload. All the “storage guys” seem to refuse to understand that.
Hey, I’m a “storage guy” and I get it. Hmmm…
“Did I mention that there is nothing (technically) preventing us
from configuring Exadata Storage Server with 3.5″ FLASH SSD drives?”
Didn’t you already mention that it would not make sense to do so because “Exadata doesn’t really need Solid State Disk”?
Hi Kevin,
>> It has always been the case that saturated CPUs cannot ingest any more data.
Yes, and that’s a very important consideration!
******************************************************************
>> I don’t understand your tangent about T1 memory locality.
Sorry to be obtuse, but is is ironic. . . .
CPU speed always outpaces RAM speed, yet in Oracle, CPU speed constrains RAM speed.
That’s all, nothing profound . . .
******************************************************************
>> the point I’m making is that the CPUs will ultimately control the IOPs
Yes, and that’s the interesting part.
On the low end of the curve, (fast CPU with slow disk), faster storage translates directly into faster throughput. Conversely (slow CPU, fast storage), I/O is constrained by CPU.
It would be interesting to understand the details between these two extremes . . . .
“Didn’t you already mention that it would not make sense to do so because “Exadata doesn’t really need Solid State Disk”?”
..well, yes, and that is why they aren’t in there.
“It would be interesting to understand the details between these two extremes . . . .”
…I assure you, those details are really quite boring.
>> I assure you, those details are really quite boring.
OK, if you say so, but anytime CPU performance becomes functionally dependent on I/O thoughput, it would be interesting to learn more . . .
“OLTP is I/O latency sensitive, but mostly for writes.” Well not on our most active OLTP 7TB database. Even with an Oracle cache hit rate at 99.8%, individual transactions are I/O bound on random reads. Random reads are accounting for just short of 70% of transactional response time in the database, with a bit under 30% being in cpu execution. Moving onto a newer array will possibly get us down to 50% of the time on random I/O, although moving to a new server will see us spending 75-80% of the time on random I/O, simply because disk latency isn’t improving notably whilst processors get faster. Writes barely figure in wait time as they are sub-ms latency due to arracy caching.
Even if the disk latency is acceptable for online transactional users, it is very often the limiting factor on lots of batch jobs. The days when a batch job was a tramp through a few tables in a largely sequential manner are long gone – many COTS packages generate batch jobs that look much like OLTP transactions.
The only way to tackle this (apart from what can be done to improve the application) is a new technology which eliminates the latency dues to spinning disks. the only game in town is SSDs. It’s not the number of IOPs (a few tens of thousands of those will kill pretty well any machine running a real application). It’s the latency.
In fact array cache can have some major improvements. One is non-volatile write caching. Critical on high throughput OLTP transactions. If you use RAID-5 then substantial amounts are essential to allow for things like roll-up write caching to allow for full write stripes.
Also non-volatile array caching allows for things like logical snapshots to be taken with an imperceptable performance impact as maps can be held in memory. Far, far more efficient than anything that can be done at the server level.
It’s quite right that many, many GB of cache won’t help the random read problem – the SGA gets all the good stuff. But there are things that can be done with an Enterprise array or with SSDs that can’t be done on a host alone. For the great majority of our OLTP systems, it isn’t the number of IOPs that is the limiting factor, it isn’t the latency on writes (cache deals with that), it’s the latency on random reads too large to fit in the SGA.
Steve,
Excellent, well thought out comment!
How did a Warehouse Machine get side tracked into an OLTP discussion? This was merely a side-bar in my blog entry that making it sound like it could just as easily be used for OLTP muddied the water. I apologize for any inaccuracies in terminology since there was little published about the Exadata at the time I wrote the blog entry and I was running off of a single Whitepaper that was available and of course, Larry’s announcement.
TPC-H is a DSS benchmark and heavily dependent on disk IO speed as there is a direct correlation between results and number of disk spindles in most cases (yes, I realize a few of the newer results using specilaized technology and “hyper” caching break this rule, just as the heavily federatd Microsoft entry for TPC-C broke a few rules a few years back.)
As I recall Larry showed a slide giving a comparison between a test server that was disk based and one of the new Exadata cell based servers for TPC-H queries, however, I can’t seem to find it anywhere on the Oracle site. I would like to see that slide again with the details of both server configurations.
If you don’t believe disk latency is important then put your 300 Gigabyte database on a single 1 terabyte disk and be done with it. I don’t care how impressive an array of CPU you place in front of a limited IO bandwidth/throughput subsystem, your performance will suffer. As TMS RamSan and the Exadata offer more bandwidth, generally speaking, than is used by the systems they feed it seems a moot point.
As to whether results or blocks are passed back, I was just quoting Larry from his announcement, he said that the Exadata cells passed back results and not blocks several times. Excuse me for quoting him.
I quite agree that inefficient processing at the CPU or memory level will result in poor peformance no matter how fast the underlying storage is made to be, Maybe Oracle should have addressed this instead of providng proprietary storage solutions.
As far as the number of licenses requred, I was incorrect, I see now that the licensing is per disk, not CPU, so the number of required liscenses is even higher than I suggested it might be, sorry for the inaccuracy.
Are you disagreeing that the users of Exadata will have to upgrade to 11.1.0.7, and throw away their exising hardware if they buy the full Warehouse Machine? We have seen just as dramatic an improvement as shown by the Warhouse Machine by properly configuring IO bandwidth and placing a set of RamSans in place, without changing anything else in the system, of course should they also iuncrease the number of CPUS and imprve memory capacity they get added benefits.
What percentage of actual disk space do you suggest the users of the Exadata Cells limit themselves to? I see that you are using 12 terabytes to achieve less than 4 terabytes of available disk capacity, just how is that broken down?
Mike,
Thanks for stopping by. I was happy to put your comment through, but I can’t take time to respond. You commented with over 500 words without using typical point/counter-point format so I can’t figure out which portions of the original post any of your comments are referring to. Perhaps other readers can…
>> I can’t figure out which portions of the original post any of your comments are referring to. Perhaps other readers can…
Nope . . .
yes, I thought his comments were easy to follow.
Something that hasn’t been mentioned in this thread is that Exadata is Oracle-only. Just Oracle. $2m for just Oracle? I want choice and this ain’t it. I have never worked anywhere (from tiddlers to giants) that had dedicated servers, never mind dedicated storage).
Also, if a company wants super-fast storage for its DW, then I can see that Exadata may well be worth looking at (barring any qualms about zero support in future, should it prove to be a dead duck which nobody wanted). But, it’s a pretty niche product and has anyone thought that DWs are, by their very nature, often gigantic? That’s a pretty hefty task even to get one’s DW data into the new kit. And that’s even more cost.
Finally, as Steve Jones pointed out: OLTP is not always write-bound. Where I work it’s a web-based system, growing data-wise daily, with a touch of DW-esque stuff thrown-in. No way is $2M going to be spent… at least until the business has grown to a size that could possibly warrant such a spend, and by then they’d be SSD-centric, anyway. And SSD will get faster, or superseded at some point, and hey! buythe new storage gizmo – not sell a Rembrandt to pay for Exadata.
My two pennies’ worth.
Hi Kevin,
I know this is an old post – was trying to get my head around the following statement:
.
“The CPU cost wrapped around the physical I/O is not trivial”
.
When you say this do you mean the resultant/associated CPU cost? I guess when I say resultant/associated I’m taking about once a database server (not speaking exadata here) receives data – this data may require ordering i.e sorting, elimination of data via joins, etc – this is all CPU stuff.
.
Or are you saying in order to perform a physical I/O – the CPU is involved too? If so could you elaborate on that? Are you talking about the CPU cost is requires to work/communicate with the HBA’s?
.
Thanks
also a thought – additionally Physical I/O may need to traverse through some sort of a cache and the oracle buffer cache which makes it logical I/O – which by its nature is a CPU based operation? i.e reading/writing from memory caches.
NM,
When I wrote about the CPU cost wrapped around the physical I/O in that post I was referring to how much CPU a workload burns outside the I/O path. The focus of that post was a hypothetical OLTP system that is generating 600,000 IOPS. Think of it this way. If Oracle services a transaction that does not require any datafile physical I/O it uses N processor cycles. These cycles fall in the many layers of the Oracle kernel like SQL layer, transaction layer, cache layer, etc. If, on the other hand, the same transaction wasn’t satisfied with SGA cache contents and and had to LRU a buffer, travel the physical I/O code path (e.g., skgfr and the OSDs like pread64()) and then chain the fresh block into cache buffers chains, would the transaction cost the same N processor cycles or less or more? The answer is, of course, more.
The whole point of that post was to point out that you can’t shave off PIO service times and get a magical boost in throughput. While Oracle administrators tend to find themselves wrestling with I/O the most, all database workloads require CPU. If you don’t have CPU to spare before breaking a disk bottleneck then set your expectations accordingly. If your system is approaching a cpu-critical state with latent physical I/O, you need to combine your PIO improvement with additional cpu. Bottlenecks are like shifting sand.
Thanks for the update, Kevin, I get it now