It turns out that my suspicion about late-breaking competitive loading rates versus loading results was not that far off base. As I discussed in my recent post about Exadata “lagging” behind competitor’s bulk-loading proof points, there is a significant difference between citing a loading rate as opposed to a loading result. While it is true that customers generally don’t stage up a neat, concise set of flat files totaling, say, 1TB and load it with stopwatch in hand, it is important to understand processing dynamics with bulk data loading. Some products can suffer peaks and valleys in throughput as the data is being loaded.
I’m not suspecting any of that where these competitor’s results are concerned. I just want to reiterate the distinction between cited loading rates and loading results. When referring to the Greenplum news of customer’s 4TB/h loading rates, I wrote:
The Greenplum customer stated that they are “loading at rates of four terabytes an hour, consistently.” […] Is there a chance the customer loads, say, 0.42 or 1.42 terabytes as a timed procedure and normalizes the result to a per-hour rate?
What if it is 20 gigabytes loading in a single minute repeated every few minutes? That too is a 4TB/h rate.
While reading Curt Monash’s blog I found his reference to Eric Lai’s deeper analysis. Eric quotes a Greenplum representative as having said:
The company’s customer, Fox Interactive Media Inc., operator of MySpace.com, can load 2TB of Web usage data in half an hour
That is a good loading rate. But that isn’t what I’m blogging about. Eric continued to quote the Greenplum representative as saying:
To achieve 4TB/hour load speeds requires 40 Greenplum servers
40 Greenplum servers…now we are getting somewhere. To the best of my knowledge, this Greenplum customer would have the Sun fire 4500 based Greenplum solution. The 4500 (a.k.a. Thumper) sports 2 dual-core AMD processors so the configuration has 160 processor cores.
While some people choose to quote loading rates in TB/h form, I prefer expressing loading rates in megabytes per second per processor core (MBPS/core). Expressed in MBPS/core, the Greenplum customer is loading data at the rate of 7.28 MBPS/core.
Summary
Bold statements about loading rates without any configuration information are not interesting.
PS. I almost forgot. I still think Option #2 in this list is absurd.
How is data defined? Is it the size of the input text? The size of the loaded tables (data, not indexes) after the load?
If your question is about the Greenplum 7.3 MBPS/core, I don’t know. In my opinion, the only metric that matters is the size of the data to be loaded, not the size of the resultant tables.
Now, having said that, how do we handle compressed flat files? Oracle can ingest compressed flat files using the External Table PREPROCESSOR feature. Some flat files compress down as much as 8:1. So If I’m pulling compressed flat file data at the line rate of 1 Gb Ethernet I have an effective ingest rate of roughly 1GB/sa assuming I am pulling in network data at the rate of approximately 115 MB/s. By the time the data expands (on the way into the database) I’m at 115MB * 8 per second.
I look at it this way. At some point the flat file data was N bytes in size–before it was compressed. That is the number I’m interested in. If the provider systems do the heavy lifting of compress-before-send, that only benefits the network. After all, it costs CPU to compress and CPU to uncompress the flat files so that is a lot to pay. I envision a better world where there provider system(s) are on the Infiniband network, or 10 GbE, so that compression flat files won’t be as necessary.
In summary, I don’t know what Greenplum’s customer is actually measuring, but I know what the measurement should be.