I Know Nothing About Data Warehouse Appliances, and Now, So Won’t You – Part I | Kevin Closson's Blog: Platforms, Databases and Storage

I Know Nothing About Data Warehouse Appliances, and Now, So Won’t You – Part I

Published July 3, 2008 oracle 12 Comments

I’ve been watching all these come-lately DW/BI technologies for a while now-especially the ever-so-highly-revered “appliances.” I’m also interested in columnar orientation as my past posts on columnar technology (e.g., columnar technology I, columnar technology II) will attest.

Rows and Columns, or Columns and Rows?

I don’t know, because in that famed Unfrozen Caveman Lawyer style, these things confuse me. However, Stuart Frost, CEO of DATAllegro, puts it this way in his fledgling blog:

At the end of the day, column orientation is just one approach to limiting the amount of data read for a given query. In effect, it’s an extreme form of vertical partitioning of the data. In modern row-oriented systems such as DATAllegro, we use sophisticated horizontal partitioning to limit the number of rows read for each query.

Clue’isms are Truisms

Huh? “Sophisticated horizontal partitioning?” Now that is a novel approach. And if all I want to scan is a column or two with Oracle, I’ll create an index. Is it really that much more complicated than that? An index is columnar representation after all. Heck, I could even partition that “columnar representation” with a sophisticated horizontal partitioning technology (that has been in Oracle since the early 1990s) to further reduce the data ingestion cost.

Indexes == Anathema

Oops, I should wash my mouth out with soap. After all, the “appliances” shall save you from the torment of creating a few indexes, right? Well, maybe not. The term of the day is “Index-Light Appliance.”

So I have to ask, what if I were to implement an Oracle-based data warehouse that used, say, 5 indexes. Would that be an Index-Light approach?

Oracle is taking steps to make the configuration of hardware for a DW/BI deployment a bit simpler. If you haven’t yet seen it, the Optimized Warehouse Initiative is worth investigating.

12 Responses to “I Know Nothing About Data Warehouse Appliances, and Now, So Won’t You – Part I”

Feed for this Entry Trackback Address

1 Marc Handelman July 3, 2008 at 9:57 pm

Yup, “sophisticated horizontal partitioning”, is what I do each night when “I lays me head down to sleep”………………….

Reply
2 Joe July 4, 2008 at 7:25 pm

If you can even consider Orcale then, frankly, your data volumes are trivial. 100 million rows? 1Tb? Trivial. I hate to be flippant but it’s just that simple.

The funny thing is that even at that level Oracle will make you jump through all kinds of hoops just to keep things moving.

When you’re loading up to a Tb per night and serving 100,000 ad-hoc, highly complex queries a day Oracle will not play.

Trust me, I’ve been there and done that. It just won’t.

If you in the 1-5 Tb range think about Sybase, Vertica or Greenplum. If you’re over 5 Tb you’ll need Teradata, Netezza, DatAllegro or Dataupia. Netezza is the fastest database money can buy.

It’s so fast that I thought it was a setup when they did the POC but we got better speeds in production.

Reply
3 kevinclosson July 4, 2008 at 9:46 pm

Joe,

Who said anything about 100 million rows or a Terabyte?

Please list some of the “all kinds of hoops.”

Can you give some examples of these adhoc, highly complex queries
please?

When you say trust you because you’ve been there and done that and “it won’t.” Is that the same as saying you’ve been there and done that and it didn’t? If not then we don’t care. If it didn’t why not list the reasons it didn’t.

Reply
4 Maxim Grinev July 5, 2008 at 4:04 pm

Kevin,

it might be interesting for you to read this paper:
“Column-stores vs. row-stores: how different are they really?”
http://portal.acm.org/citation.cfm?doid=1376616.1376712

The authors compare row-oriented database tuned in a ways described above using index-only query execution plans, vertical partitioning, materialized views with only columns needed.

The authors conclude that “while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach.”

And then:
“A successful column-oriented simulation will require some important
system improvements, such as virtual record-ids, reduced
tuple overhead, fast merge joins of sorted data, run-length encoding
across multiple tuples, and some column-oriented query execution
techniques like operating directly on compressed data, block processing, invisible joins, and late materialization.”

Reply
5 Raya July 5, 2008 at 5:34 pm

So I’ll take a middle ground. What Joe said about Oracle is absolutely right. For large data, Oracle requires a lot of tuning and customization. For ad hoc complex queries (lots of complex joins, sorts, aggregations) on huge data volumes, the performance just stinks.

That said – Netezza is absolutely *NOT* the fastest database. It’s great for streaming queries (eg. filtering) and decent for complex queries to a certain moderate data size. But for large volumes, it also just doesn’t scale. The real benefit of Netezza is the simplicity – great appliance, pop it in and it works without tuning or jumping hoops like Oracle. Interestingly, as Oracle (and Teradata) unveil appliances, I wonder if Netezza’s days are limited:)

Most database tech gurus know that for true scalability, the newcomers are the ones truly have the technology. It’s still a crowded field but the 4 horses I’d bet on are (in alphabetical order): Aster Data, Datallegro, Greenplum, and Vertica. These guys all have very interesting scaling possibilities and the market is big enough (and growing fast enough) to support them all.

Reply
6 kevinclosson July 5, 2008 at 6:40 pm

Maxim, Raya, thanks.

Reply
7 Joe July 7, 2008 at 8:45 pm

Kevin,

I probably should have read a few more of your posts before I commented. Nevertheless, my points still stand.

If you talking about DWA’s you talking 100m rows or 1Tb minimum. Otherwise you’re wasting our time.

Ad-hoc, highly complex queries like the kind that a team of 50 SAS jockeys can only write. Ridiculous joins galore. Multiple deeply nested subqueries hitting the same table in repetition.

Volumes like 10,000 queries an hour competing with the previous.

Oracle hoops? Where do I start… Saturated RAC clusters. Insane indexing schemes that consume 3x the user data. Multiple hundred drives each at 20% utilisation. Come on Kev, I’ve read a few of your posts. You must know that Oracle doesn’t exactly make this stuff easy.

We rolled Netezza off the truck and our load window *instantly* shrunk by a factor of 100. Our worst query instantly performed 1000x better. That’s real speed on a naive implementation.

I’m not sure what volumes Raya is talking about. I have not seen or heard about a point of diminshing returns and I ask around when I speak to other customers. The only drawback has been a rip and replace upgrade path which I believe they are adressing.

Look, I’m not saying Oracle sucks for everything. I am saying that it is not suitable for serious DWs.

Reply
8 kevinclosson July 7, 2008 at 9:09 pm

Joe,

Thanks for the elaboration as it helps develop this thread, most particularly the bit about hundreds of underutilized drives–a problem most generally attributable to storage array architecture.

https://kevinclosson.wordpress.com/2007/08/13/hard-drives-are-archane-technology-why-cant-i-realize-their-full-bandwidth-potential/

Reply
9 Joe July 8, 2008 at 9:48 pm

My very last comment before I dispense with the inanity. (BTW choose someone besides DATAllegro if you want to poke at appliances. They are firmly in the Ab Initio school of PR).

The drives weren’t “underutilised” due to “storage array” issues Kev. Anyone using a SAN for a serious warehouse is already so far in the woods that it would be merciful to just shoot them.

I’m specifically talking about a very large number of direct attached, high speed drives used to overcome I/O latency. Exactly like every TPC benchmark in this area.

A parting thought, we run a huge data warehouse with 1 DBA. Part time. Try that on Oracle lately?

What is your real beef with DWAs? Is it impossible for you to believe that Oracle is not the best database for every single use case?

Every technology product is a complex chain of compromises. Oracle cannot be all things to all people with one engine. It’s about time they owned up to it.

Reply
10 kevinclosson July 8, 2008 at 11:34 pm

Joe,

Sorry it seems like insanity. I’m taking a few moments to point out holes in some of the over-hyped stories out there. Nothing more sinister than that. You obviously have experience on some DW appliances and I appreciate your comments on this blog. I do wish you could provide more details from your experiences along the lines of specific queries, schema and disk throughput in GB/s. It always sounds like Carl Sagan with words like “billions and billions” when people talk about these DW Appliance deployments.

I don’t have a beef with DW Appliances, per se and I certainly don’t think Oracle is the best database for every single use case. I have learned over the years that in those cases where Oracle gets trumped, as folks believe they have in the DW/BI space, the sleeping giant generally wakes up and makes the point moot sooner rather than later.

I can perceive from your posts, Joe, that my array bottleneck reference was not likely to apply to your case. I put that out there because I do believe that most people that jump from Oracle to a DW Appliance have left behind a very unbalanced SAN configuration that was actually the cause of most, if not all, of their problems to start with. I’ve seen it time and time again. Bath water first, baby follows.

Reply

1 Data Warehouse Appliances and Column-oriented DBMS « Oracle and other Trackback on July 9, 2008 at 1:49 pm
2 Bookmarks about Storage Trackback on January 3, 2009 at 8:00 pm

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage