I’ve received a good deal of email following my recent blog entry entitled Winter Corporation Assessment of Exadata Performance: Lopsided! Test it All, or Don’t Test at All? I’m not going to continue the drama that ensued from that blog post, but an email I received the other day on the matter warrants a blog entry. The reader stated:
[…text deleted…] that is why I sort of agree with Dan. It makes no sense to load a huge test database without showing how long it took to load it. Now I see Greenplum has very fast data loading Oracle should wake up or […text deleted…]
That is a good question and it warrants this blog entry. I was quite clear in my post about why the Winter Corporation report didn’t cover every imaginable test, but I want to go into the topic of data loading a bit.
The reader asking this question was referring to a blogger who took Richard Winter’s Exadata performance assessment to task citing three areas he deemed suspiciously missing from the assessment. The first of these perceived shortcomings is what the reader was referring to:
High Performance Batch Load – where are the performance numbers of high performance batch load, or of parallel loads executing against the device? How many parallel BIG batch loads can execute at once before the upper limits of the machine and Oracle are reached?
So the blogger and the reader who submitted this question/comment are in agreement.
The Winter Corporation Exadata performance assessment is quite clear in two areas related to the reader’s question. First, the report shows that the aggregate size of the tables was 14 terabytes. With Automatic Storage Management, this equates to 28 terabytes of physical disk. Second, the report is clear that Exadata Storage Server offers 1 GB/s (read) disk bandwidth. If Exadata happened to be unlike every other storage architecture by offering parity between read and write bandwidth, loading the 28 TB of mirrored user data would have taken only 2000 seconds which is a load rate of about 50 TB/h. Believe me, it wasn’t loaded at a 50 TB/h rate. See, Exadata-related literature is very forthcoming about the sustained read I/O rate of 1 GB/s per Exadata Storage Server. It turns out that the sustained write bandwidth of Exadata is roughly 500 MB/s or 7 GB/s for a full-height HP Oracle Database Machine. Is that broken?
Writes are more expensive for all storage architectures. I personally don’t think a write bandwidth equivalent to 50% of the demonstrated read bandwidth is all that bad. But, now the cat is out of the bag. The current maximum theoretical write throughput of a HP Oracle Database Machine is a paltry 7 GB/s (I’m being facetious because, uh, 7GB/s write bandwidth in a single 42U configuration is nothing to shake a stick at). At 7 GB/s, the 28 TB of mirrored user data would have been loaded in 4000 seconds (a load rate of about 25 TB/h). But, no, the test data was not loaded at a rate of 25 TB/h either. So what’s my point?
My point is that even if I told you the practical load rates for Exadata Storage Server, I wouldn’t expect you to believe me. After all, it didn’t make the cut for inclusion in the Winter Corporation report so what credence would you give a blog entry that claims something between 0 TB/h and 25 TB/h? Well, I hope you give at least a wee bit of credence because the real number did in fact fall within those bounds. It had to. As an aside, I have occasionally covered large streaming I/O topics specific to Exadata but remember that a CTAS operation is much lighter than ingesting ASCII flat file data and loading into a database. I digress.
This is an Absurd Blog Entry
Is it? Here is how I feel about this. I know the practical data loading rates of Exadata but I can’t go blurting it out without substantiation. Nonetheless, put all that aside in your minds for a moment.
Until quite recently, bulk data loading claims by DW/BI solution providers such as Netezza and Vertica were not exactly phenomenal. For instance, Netezza advertises 500 GB/hour load rates in this collateral and Vertica claims of 300 MB/minute and 5 GB/minute seemed interesting enough for them to mention it in their collateral. As you can tell, data loading rates are all over the map. So, one of two things must be true; either a) Oracle is about to get more vocal about Exadata bulk loading capabilities, or b) Exadata is a solution that offers the best realizable physical disk read rates but an embarrassing bulk data loading rate.
I know the competition is betting on the latter because they have to.
What About the Reader’s Question?
Sorry, I nearly forgot. The reader was very concerned about bulk data loading and mentioned Greenplum. According to this NETWORKWORLD article, Greenplum customer Fox Interactive Media is quoted as having said:
We’re loading at rates of four terabytes an hour, consistently.
Quoting a customer is a respectable proof point in my opinion. The only problem I see is that there is no detail about the claim. For instance, the quote says “rates of four terabytes an hour […]” There is a big difference between stating a loading rate and a loading result. For example, this Greenplum customer cites a rate of 4TB/h (about 1.1GB/s) without mention of the configuration. Let’s suppose for a moment that the configuration is a DWAPP-DW40, which is the largest single-rack configuration available (to my knowledge) from Greenplum. I admit I don’t know enough about Greenplum architecture to know, but depending on how RAM is used by the four Sun fire x4500 servers in the configuration (aggregate 64 GB), it is conceivable that the customer could “load” data at a 40 TB/h rate for an entire minute without touching magnetic media.
I presume Greenplum doesn’t cache bulk-inserted data, but the point I’m trying to make is the difference between a loading rate and a loading result. It is a significant difference. Am I off my rocker? Well, let’s think about this. The Greenplum customer stated that they are “loading at rates of four terabytes an hour, consistently.” Who amongst you thinks the customer stages up exactly 1TB of data in production and then plops it into the database with stopwatch in hand? Is there a chance the customer loads, say, 0.42 or 1.42 terabytes as a timed procedure and normalizes the result to a per-hour rate? Of course that is possible and totally reasonable. So why would it be totally absurd for me to suggest that perhaps the case is an occasional load of, say, 100 gigabytes taking about 350 seconds (also an approximate 4TB/h rate)? What if it is 20 gigabytes loading in a single minute repeated every few minutes? That too is a 4TB/h rate. And, honestly, both would be truthful and valid depending on the site requirements. The point is we don’t know. But I’m not going to totally discount the information because it is missing something I deem essential.
I’m willing to take the information reported by the Greenplum customer and say, “Yes, 4 TB/h is a good load rate!” Thus, I have answered the blog reader’s original question about bulk loading vis a vis the recent Greenplum news on the topic.
The NETWORKWORLD article quotes Ben Werther, director of product marketing at Greenplum, as having said:
This is definitely the fastest in the industry, [ … ] Netezza for example quotes 500GB an hour, and we have not seen anyone doing more than 1TB an hour.
Well, I think Ben has it wrong. Vertica has a proof point of a loading result of 5.4 terabytes in 57 minutes 21.5 seconds which is a rate of 5.65 TB/h. This result was independently validated (no doubt a paid engagement which is no-less valid) and to be trusted. So, Ben’s statement represents a 5X under-exaggeration, which is better than 100X, 100X+ exaggerations I occasionally rant about.
A Word About the Vertica Bulk-Loading Result
It used TPC-H DBGEN to load data into a 3rd normal form schema. While I’m not going to totally discount a test because it doesn’t embody a tutorial on schema design, many readers may not know this fact about that particular proof point. The proof point is about data loading, not about schema design. The blocks of data were splat upon the round, brown spinning thingies at the reported, validated rate. No questioning that. Schemas have nothing to do with that though and thus it is just fine to use a 3rd normal form schema for such a proof point.
Greenplum and Vertica have good proof points out there for bulk data loading and single-rack HP Oracle Database Machine cannot bulk-load data faster than roughly 25 terabytes per hour. Oracle’s competitors rest on hopes that the actual bulk-loading capability of HP Oracle Database Machine is a small fraction of that. For the time being, it seems, that will remain the status quo.
I’ve lived prior lives wishing the competition was pidgeon-holed one way or the other.