Several readers sent in email questions after reading the Winter Corporation Paper about Exadata I announced in a recent blog entry. I thought I’d answer one in this quick blog entry.
The reader’s email read as follows (I made no edits other than to remove the SAN vendor’s name):
I have read through the Winter Corporation paper regarding Exadata, but is it ok for me to ask a questio? We have an existing data warehouse on a 4 node 10g RAC cluster attached to a [ brand name removed ] SAN array by 2 active 4GB ports on each RAC node. When we test with Orion we see nearly 2.9 gigabytes per second throughput but with RDBMS queires we never see more than about 2 gigabytes per sec throughput except in select count(*) situation. With select count(*) we see about 2.5GB/s. Why is this?
Think Plumbing First
It is always best to focus first on the plumbing, and then on the array controller itself. After all, if the supposed maximum theoretical throughput of an array is on the order of 3GB/s, but servers are connected to the array with limited connectivity, the bandwidth is unrealizable. In this case, 2 active 4GFC HBAs per RAC node demonstrate sufficient throughput for this particular SAN array. Remember, I deleted the SAN brand. The particular brand cited by the reader is most certainly limited to 3GB/s (I know the brand and model well) but no matter because the 2 active 4GFC paths to each RAC node limit I/O to an aggregate of 3.2GB/s sustained read throughput no matter what kind of SAN it is. This is actually a case of a well-balanced server-to-storage configuration and I pointed out so in a private email to the blog reader who sent me this question. But, what about the reader’s question?
Orion is a Very Light Eater
The reason the reader is able to drive storage at approximately 2.9GB/s with Orion is because Orion does nothing with the data being read from disk. As I/Os are completed it simply issues more. We sometimes call this lightweight I/O testing because the code doesn’t touch the data being read from disk. Indeed, even the dd(1) command can drive storage at maximum theoretical I/O rates with a command like dd if=/dev/sdN of=/dev/null bs=1024k. A dd(1) command like this does not touch the data being read from disk.
The reason the reader sees Oracle driving storage at 2.5GB/s with a SELECT COUNT(*) is because when a query such as this reads blocks from disk only a few bytes of the disk blocks being read are loaded into the processor caches. Indeed, Oracle doesn’t have to touch every row piece in a block to know how many rows the block contains. There is summary information in the header of the block that speeds up row counting. When code references just one byte of data in an Oracle block, after it is read from disk, the processor causes the memory controller to load 64 bytes (on x86_64 cpus) into the processor cache. Anything in that 64-byte “line” can be accessed for “free” (meaning additional loads from memory are not needed). Accesses to any other 64-byte lines in the Oracle block causes subsequent memory lines to be installed into the processor cache. While the CPU is waiting for a line to be loaded it is in a stalled state, which is accounted for as user-mode cycles charged to the Oracle process referencing the memory. The more processes do with the blocks being read from disk, the higher processor utilization goes up and eventually I/O goes down. This is why the reader stated that they see about 2GB/s when Oracle is presumably doing “real queries” such as those which perform filtration, projection, joins, sorting, aggregation and so forth. The reader didn’t state processor utilization for the queries seemingly limited to 2GB/s, but it stands to reason they were more complex than the SELECT COUNT(*) test.
Double Negatives for Fun and Learning Purposes
You too can see what I’m talking about by running a select that ingests dispersed columns and all rows in a query after performing zero-effect filtration such as the following 16-column table example:
SELECT AVG(LENGTH(col1)), AVG(LENGTH(col8)), AVG(LENGTH(col16)) FROM TABX WHERE col1 NOT LIKE ‘%NEVER’ and col8 NOT LIKE ‘%NEVER’ and col16 NOT LIKE ‘%NEVER’;
This test presumes columns 1,8 and 16 never contain the value ‘NEVER’. Observe the processor utilization when running this sort of query and compare that to a simple select count(*) of the same table.
Sure, the reader’s throughput difference between the SELECT COUNT(*) and Orion could be related to tuning issues (i.e., Parallel Query Option degree of parallelism). However, in my experience achieving about 83% of maximum theoretical I/O with SELECT COUNT(*) is pretty good. Further, the reader’s complex query achieved about 66% of maximum theoretical I/O throughput which is also quite good–when using conventional storage.
What Does This Have To Do With Exadata
Exadata offloads predicate filtering and column projection (amongst the many other value propositions). Even this silly example has processing that can be offloaded to Exadata such as filtration that filters out no rows and the cost of projecting columns 1,8 and 16. The database host spends no cycles with the filtration or projection. It just performs the work of the AVG() and LENGTH() functions.
I didn’t have the heart to point out to the reader that 3GB/s is the least amount of throughput available when using Exadata and Real Application Clusters (RAC). That is, with RAC the fewest number of Exadata Storage Servers supported is 3 and there’s no doubt that 3 Exadata Storage Servers do indeed offer 3GB/s query I/O throughput. In fact, as the Winter Corporation paper shows, Exadata is able to perform maximum theoretical I/O throughput even with complex, concurrent queries because there is 2/3rds of a Xeon 54XX “Harpertown” processor for each disk drive offloading processing from the database grid.
So, while Orion is indeed a “light eater”, Exadata is quite ravenous.