Bulk Data Loading Rates. Is 7.3 MB/s per CPU Core Fast, or Fast Enough? Part I.

It turns out that my suspicion about late-breaking competitive loading rates versus loading results was not that far off base. As I discussed in my recent post about Exadata “lagging” behind competitor’s bulk-loading proof points, there is a significant difference between citing a loading rate as opposed to a loading result. While it is true that customers generally don’t stage up a neat, concise set of flat files totaling, say, 1TB and load it with stopwatch in hand, it is important to understand processing dynamics with bulk data loading. Some products can suffer peaks and valleys in throughput as the data is being loaded.

I’m not suspecting any of that where these competitor’s results are concerned. I just want to reiterate the distinction between cited loading rates and loading results. When referring to the Greenplum news of customer’s 4TB/h loading rates, I wrote:

The Greenplum customer stated that they are “loading at rates of four terabytes an hour, consistently.” […] Is there a chance the customer loads, say, 0.42 or 1.42 terabytes as a timed procedure and normalizes the result to a per-hour rate?

What if it is 20 gigabytes loading in a single minute repeated every few minutes? That too is a 4TB/h rate.

While reading Curt Monash’s blog I found his reference to Eric Lai’s deeper analysis. Eric quotes a Greenplum representative as having said:

The company’s customer, Fox Interactive Media Inc., operator of MySpace.com, can load 2TB of Web usage data in half an hour

That is a good loading rate. But that isn’t what I’m blogging about. Eric continued to quote the Greenplum representative as saying:

To achieve 4TB/hour load speeds requires 40 Greenplum servers

40 Greenplum servers…now we are getting somewhere. To the best of my knowledge, this Greenplum customer would have the Sun fire 4500 based Greenplum solution.  The 4500 (a.k.a. Thumper) sports 2 dual-core AMD processors so the configuration has 160 processor cores.

While some people choose to quote loading rates in TB/h form, I prefer expressing loading rates in megabytes per second per processor core (MBPS/core). Expressed in MBPS/core, the Greenplum customer is loading data at the rate of 7.28 MBPS/core.

Summary
Bold statements about loading rates without any configuration information are not interesting.

PS. I almost forgot. I still think Option #2 in this list is absurd.

2 Responses to “Bulk Data Loading Rates. Is 7.3 MB/s per CPU Core Fast, or Fast Enough? Part I.”


  1. 1 Mark Callaghan March 21, 2009 at 1:27 pm

    How is data defined? Is it the size of the input text? The size of the loaded tables (data, not indexes) after the load?

    • 2 kevinclosson March 21, 2009 at 6:15 pm

      If your question is about the Greenplum 7.3 MBPS/core, I don’t know. In my opinion, the only metric that matters is the size of the data to be loaded, not the size of the resultant tables.

      Now, having said that, how do we handle compressed flat files? Oracle can ingest compressed flat files using the External Table PREPROCESSOR feature. Some flat files compress down as much as 8:1. So If I’m pulling compressed flat file data at the line rate of 1 Gb Ethernet I have an effective ingest rate of roughly 1GB/sa assuming I am pulling in network data at the rate of approximately 115 MB/s. By the time the data expands (on the way into the database) I’m at 115MB * 8 per second.

      I look at it this way. At some point the flat file data was N bytes in size–before it was compressed. That is the number I’m interested in. If the provider systems do the heavy lifting of compress-before-send, that only benefits the network. After all, it costs CPU to compress and CPU to uncompress the flat files so that is a lot to pay. I envision a better world where there provider system(s) are on the Infiniband network, or 10 GbE, so that compression flat files won’t be as necessary.

      In summary, I don’t know what Greenplum’s customer is actually measuring, but I know what the measurement should be.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,976 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: