Oracle Exadata Storage Server. Frequently Asked Questions. Part II

Published September 26, 2008 oracle 29 Comments

This is installment number two in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit Exadata Storage Server Frequently Asked Questions Part I. I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-wording them as questions.

Later today The Pythian Group will be conducting a podcast question and answer interview with me that will be available on their website shortly thereafter. I’ll post a link to that when it is available.

Questions and Answers

Q. [I’m] willing to bet this is a full-blown Oracle instance running on each Exabyte [sic] Storage Server.

A. No, bad bet. Exadata Storage Server Software is not an Oracle Database instance. I happened to have an xterm with a shell process sitting in a directory with the self-extracting binary distribution file in it. We can tell by the size of the file that there is no room for a full-blown Oracle Database distribution:

$ ls -l cell*

-rwxr-xr-x 1 root root 206411729 Sep 12 22:04 cell-080905-1-rpm.bin

Q. This must certainly be a difficult product to install, right?

A. HP installs the software on their manufacturing floor. Nonetheless I’ll point out that installing Oracle Exadata Storage Server Software is a single execution of the binary distribution file without options or arguments. Further, initializing a Cell is a single command with two options of which only one requires an argument; such as the following example where I specify a bonded Infiniband interface for interconnect 1:

# cellcli

CellCLI: Release 11.1.3.0.0 – Production on Fri Sep 26 10:56:17 PDT 2008

Cell Efficiency Ratio: 10,956.8

CellCLI> create cell cell01 interconnect1=bond0

After this command completes I’ve got a valid cell. There are no preparatory commands (e.g., disk partitioning, volume management, etc).

Q. I’m trying to grasp whether this is really just being pitched at the BI and data warehouse space, or whether it has real value in the OLTP space as well.

A. Oracle Exadata Storage Server is the best block server for Oracle Database, bar none. That being said, in the current release, Exadata Storage Server is in fact optimized for DW/BI workloads, not OLTP.

Q. I know we shouldn’t set too much store in these things, but are there plans to submit TPC benchmarks?

A. You are right that there should not be as much stock placed in TPC benchmarks, but they are a necessary evil. I don’t work in that space, but could you imagine Oracle not doing some audited benchmarks? Seems unlikely to me.

On the topic of TPC benchmarks, I was taking a gander at the latest move in the TPC-C “arms race.” This IBM RS600 p595 result of 6,085,166 TpmC proves that the TPC-C is not (has never been) an I/O efficiency benchmark. If you throw more gear at it, you get a bigger number! Great!

How about a stroll down memory lane.

When I was in Database Engineering in Sequent Computer Systems back in 1998, Sequent published a world-record Oracle TPC-C result on our NUMA system. We achieved 93,901 TpmC using 64GB main memory. The 6,085,166 IBM number I just cited used 4TB main memory. So how fulfilling do you think it must be to do that much work on a TPC-C just to prove that in 10 years nothing has changed! The Sequent result comes in at 1 TpmC per 714KB main memory and the IBM result at 1 TpmC per 705KB main memory. Now that’s what I want to do for a living! Build a system with 10,992 disk drives and tons of other kit just to beat a 10-year-old result by 1.3%. Yes, we are now totally convinced that if you throw more memory at the workload you get a bigger number! In the words of Gomer Pyle, “Soo-prise, Soo-prise, Soo-prise.” Ok, enough of that, I don’t like arms-race benchmarks.

Q. From the Oracle Exadata white paper: “No cell-to-cell communication is ever done or required in an Exadata configuration.”and a few paragraphs later: “Data is mirrored across cells to ensure that the failure of a cell will not cause loss of data, or inhibit data accessibility” Can both these statements be true and would we need to purchase a minimum of two cells for a small-ish ASM environment?

A. Cells are entirely autonomous and the two statements are true indeed. Consider two ASM disks out in a Fibre Channel SAN. Of course we know those two disks are not “aware of each other” just because ASM is using blocks from each to perform mirroring. The same is true for Oracle Exadata Storage Server cells and the drives housed inside them. As for the second part of the questions, yes, you must have a minimum of two cells. In spite of the fact that Cells are shared nothing (unaware of each other), ASM is in fact Cell-aware. ASM is intelligent enough to not mirror between 2 drives in the same Cell.

Q. Can this secret sauce help with write speeds?

A. That depends. If you have a workload suffering from the loss of processor cycles associated with standard Unix/Linux I/O libraries then, sure. If you have an application that uses storage provisioned from an overburdened back-end Fibre Channel disk loop (due to application collusion) then, sure. Strictly speaking, the “secret sauce” is the Oracle Exadata Storage Server Software and it does not have any features for write acceleration. Any benefit would have to come from the fact that the I/O pipes are ridiculously fast and the I/O protocol is ridiculously lightweight and the system on a whole is naturally balanced. I’ll blog about the I/O Resource Management (IORM) feature of Exadata soon as I feel it has positive attributes that will help OLTP applications. Although it is not an acceleration feature, it eliminates situations where applications steal storage bandwidth from each other.

Q. I like your initial overview of the product, but I believe that you need to compare both Netezza and Exadata side by side in real-world scenarios to gauge their performance.

A. I partially agree. I cannot go and buy a Netezza and legally produce competitive benchmark results based on the gear. Just read any EULA for any storage management software and you’ll see the bold print. Now that doesn’t mean Oracle’s competitors don’t do that. I think the comparison will come in the form of reduced Netezza sales. Heaven knows the 16% drop in Netezza stock was not as brutal as I expected.

Q. Re. [your] comparison to Netezza [in your first Exadata related post]. It’s bit of apple to oranges, really. You assume 80MB/s per disk for Exadata and for some reason only 70MB/s per disk for Netezza. Also, you have 168 disks spinning in parallel on Exadata and 112 on Netezza. Had your assumptions been tha same, sequential IO throughput would be similar, at least theoretically.

A. Reader, I invite you to explain to us how you think native SATA 7,200 RPM disk drives are going to match 15K RPM SAS drives. When I put 70 MB/s into the equation I was giving quite a benefit of the doubt (as if I’ve never measured SATA performance). Please, if you have a Netezza let us know how much streaming I/O you get from a 7,200 RPM SATA drive once you read beyond the first few outside sectors. I have also been using the more conservative 80 MB/s for our SAS drives. I’m highballing SATA and low-balling SAS. That sounds fair to me. As for the comparison between the numbers of drives, well, Netezza packaging limits the drive (SPU) count to 112 per cabinet. It would suit me fine if it takes a 1 plus another half rack to match a single HP Oracle Database Machine. That empty half of the rack would be annoying from a space constraint point of view though. Nonetheless, if you did go with a rack and a half (168 SPU), would that somehow cancel out the base difference in drive performance between SATA and SAS?

29 Responses to “Oracle Exadata Storage Server. Frequently Asked Questions. Part II”

Feed for this Entry Trackback Address

1 Matt Zito September 26, 2008 at 9:42 pm

As usual, a really good FAQ about this technology – it is interesting how “dumb” the cell technology is, or at least appears, given this FAQ. The storage server appears to basically be a thin RDMA shim between the disks and ASM…would you agree with that?

Also, this Register article:

http://www.channelregister.co.uk/2008/09/26/hp_makes_kit_for_oracle/

Claims the storage server is powered by Polyserve – I assume this is totally incorrect, based on the last few blog posts. Oughta write the guy and let him know.

Matt

Reply
2 Matt Zito September 26, 2008 at 9:48 pm

Also, is exadata using iSER internally? Or is it a proprietary protocol?

Reply
3 kevinclosson September 26, 2008 at 10:09 pm

Matt,

The wire protocol is Reliable Datagram Sockets, not iSER.

Reply
4 kevinclosson September 26, 2008 at 10:21 pm

That ChannelRegister article is a train wreck in more ways than one. As the former Chief Architect of Oracle Solutions at PolyServe and currently a Performance Architect on the Exadata product development team I can assure you there is no PolyServe in Exadata…

I sent that guy some email.

Reply
5 kevinclosson September 26, 2008 at 11:06 pm

“As usual, a really good FAQ about this technology – it is interesting how “dumb” the cell technology is, or at least appears, given this FAQ. ”

…You’re joking, right? Filtration, column projection, bloom filters, fast file creation and I/O resource management seems dumb?

…How do you get to that conclusion from an installment in a series of FAQs? Perhaps if I answered every possible question it wouldn’t seem that way to you?

Reply
6 Rob Johnson September 27, 2008 at 2:59 am

I have two questions:

(1) The storage server sends filtered results (by row and column) to the database server instead of blocks. How does the db server cache these results in its SGA? Or does it still only cache old-fashioned blocks, and not Exadata results? How is data cached in general? (OK, that was a 3-part question.)

(2) When RAC became more popular, Oracle Corp. created a how-to guide for cheap, home-based (and unsupported) RAC systems using VMs. Will there be something like that for Exadata?

Reply
7 Val September 27, 2008 at 3:57 pm

1. “A. Reader, I invite you to explain to us how you think native SATA 7,200 RPM disk drives are going to match 15K RPM SAS drives. When I put 70 MB/s into the equation I was giving quite a benefit of the doubt (as if I’ve never measured SATA performance).”

I will decline the invitation because I do not know what kind of disk drive Netezza uses (we do not own one).

Whatever I could find on the Internet, indicates that the average scan speed is about 60 MB/s per disk which seems to indicate that they do use 7200 rpm disks (http://www.netezzacommunity.com/blogs/nzblog/2008/01/01/issue-16-the-latest-addition-to-netezzas-fast-engines-framework)
So, to match the Exadata sequential scan throughput, one would apparently need 2 full Netezza racks unless they can provide a configuration with 15K disks in which case a one-and-a-half configuration might achieve the same rate.

2. As I mentioned earlier, while the description looks quite impressive, Exadata TPC-H numbers would be truly interesting.

Reply
8 kevinclosson September 27, 2008 at 4:19 pm

Val, I stated matter-of-factly that they use 7,200 RPM SATA drives and in-spite of that fact I still cooked a 70MBPS/drive base for them inot my comparison. So, do you still think my 70 MB/s SATA to 80 MB/s SAS is, in your words, a “bit of apples to oranges” or was I in fact being generous to SATA? I think the latter.

As for TPC-H, you have to remember that it is NOT a DW/BI workload. It is a 3rd formal form schema with every trick in the book used to reduce the I/O load. In fact, there are results these days that do no I/O at all. Exadata’s job in life is to DO I/O. That is not my way of saying there will be no TPC-H results. I don’t know whether there will, or won’t. I don’t wear that hat.

Reply
9 Val September 27, 2008 at 5:18 pm

Kevin,

My point with apples and oranges was that one may want to compare the same number of parallel spindles with the same performance characteristics, or at least to try to equalize the configurations by adjusting the number of parallel spindles so that the “theoretical” throughput of either configuration would be as close as possible (two full racks of Netezza vs. the 168 disk Exadata configuration or something similar). Otherwise, the comparison might look suspect.

I honestly do not care how close the TPC-H 22 queries may or may not be to the “real world” DW workload. They are just a bunch of selects/joins/order by/group by/etc. statements, and if one has at least some idea of how a database goes about executing the queries(hash/merge/nested loops, FTS and so forth), one could more or less exactly extrapolate TPC-H stuff behaviour to the “real life” queries execution. It is also quite easy to see how a specific vendor product might behave in the concurrent mode (TPC-H “throughput” test).

Reply
10 kevinclosson September 27, 2008 at 5:38 pm

Ok Val, no blood no foul. At first glance it looked like a FUD comment.

Reply
11 Jim September 28, 2008 at 3:49 pm

Kevin,
This looks like it would fit into the whole idea of database as a service approach in the older RAC white papers. Also don’t some of the DW vendors split the data up in a shared nothing method. Thus when the data has to be repartitioned it gets expensive. Whereas here you just add another cell and ASM goes to work in the background. (depending upon the ASM power level you set.)

Reply
12 Geert De Paep September 28, 2008 at 6:46 pm

Kevin,
A question for FAQ Part III: Suppose I have a server (e.g. linux) or a number of RAC nodes, how do I connect them to the Exadata and how do I access the disk space? Do I need fibre or ethernet connections, switches, special hardware between my server and the Exadata? Do I still need OS multipath software? Do I see raw luns that I present to an ASM instance running on my own machine(s) or does my database communicates directly with the ASM on the Exadata? Do I still have to struggle with raw devices on os level? Can I create multiple databases in the available space? Do I still need to create asm disks or diskgroups, or do I just see one large asm disk of e.g. 168Tb? A technical picture of the connectivity and configuration would be most welcome.

Reply
13 Gary September 29, 2008 at 2:08 am

Kevin,
Not sure this is exactly in your field, but what does the cell do when it hits a block that was flushed to disk before being committed (ie needs an UNDO check) ? Can it return a mix of rowset and block data so the DB server checks UNDO ?

Reply
14 Val September 29, 2008 at 3:30 pm

Kevin,

Yet another minor nit 😉 As I commented elsewhere, 20Gb/s is the IB baud rate, the useful bit rate is 16Gb/s (8b/10b encoding). I am not sure why the IB folks keep using the baud numbers.

Reply
15 kevinclosson September 29, 2008 at 4:48 pm

Val,

Not minor nits at all. Thanks. We have used pretty poor precision when it comes to this topic. Let me offer mitigating circumstances.

While the pipe is indeed limited to 16Gb payload (20% less than our routinely stated 20Gb), that is still nearly twice the amount of data a cell can produce by streaming I/O from disk in the first place. So, shame on us for being 20% off, but kudos to us for making the pipes nearly 100% too big?

Reply
16 Michael Norbert September 29, 2008 at 8:01 pm

This is exciting stuff if it can do away with all the annoyances of teradata. You have to do alot of extra work in teradata in order to get single-keyed reads to perform properly that aren’t using the primary key. This is a requirement in our DW as the DW is seen as a single version of the truth so users are required to hit the DW for any type of company information. So we must really make sure our queries hit a single node in Teradata.

Reply
17 H.Tonguç Yılmaz September 30, 2008 at 1:06 pm

Kevin hi,

Is there a coming Upgrading Guide kind of document or a step by step installation metalink note planned for the database machine?

As technical people we need to briefly explain the testing and upgrading path of this new technology to our management with its costs and risks, how do you advise we start?

There is an online development for APEX at http://apex.oracle.com Someone requests a workspace with limited space and just can test the product. What do you think of a similar environment for interested Oracle customers for test purposes or just dedicated to TB Club members for example?

Thank you.

Reply
18 Vincent October 3, 2008 at 5:53 am

“On the topic of TPC benchmarks, I was taking a gander at the latest move in the TPC-C “arms race.” This IBM RS600 p595 result of 6,085,166 TpmC proves that the TPC-C is not (has never been) an I/O efficiency benchmark. If you throw more gear at it, you get a bigger number! Great!”

Kevin, why do you think it’s not I/O efficiency benchmark? It’s CPU and I/O efficient. Take a look how many disks those guys installed in the system. It’s almost 11000 disk drives! I’m pretty sure the reason for that was a huge I/O rate. They had to balance processors speed with amount of memory, SAN and disk configuration. You need to feed the CPUs with data or you get a lot if I/O waits, which are idle cycles. Otherwise, they would need fast CPUs and memory only, as you said. And the price/performance would be much lower.

Vincent

Reply
19 kevinclosson October 3, 2008 at 2:08 pm

Vincent,

It is a memory and concurrency benchmark. We used roughly 1/60th the the amount of memory and got 1/60th the throughput. Do you think we used 1/60th (e.g., approximately 180) disk drives?

Yes, it requires a balanced system. That is never in doubt.

Reply
20 Vincent October 3, 2008 at 4:20 pm

Kevin,

my understanding of TPC-C is that they test server’s performance (CPU, memory) and concurrency (which comes from architecture design). To achieve maximum server’s throughput and utilization (higher score) they need to connect a lot of disk drives. It’s a database running there 🙂

It’s like with cars. To demonstrate what the maximum speed is, a good race track is required. Even the fastest car would not be fast anymore on a bumpy road.

I’ve been monitoring TPC-C results for a few years, and if you look at top 10, gyus were using thousands of disk drives. The higher score, the more disks. That’s because processors are thousands times faster than disks. Amdahl’s law lives firever. But still, the score is related to maximum server’s performance. For storage, there are SPC-1 and SPC-2 tests (and yes, folks need fast servers there as well to suck the data from/to the storage).

I work for business partner of both HP and IBM and what I think is good about this benchmark is that customers may easily see:

– how different vendors compare in terms of maximum theoretical performance,
– compare per core performance,
– see how good each architecture scales up.

No matter which OLTP system they have, the TPC-C results are still good for choosing the hardware platform. In fact I have never seen a situation when in real-life tests the comparison results were opposite to TPC-C score.

So far HP and IBM are the best there (and probably HP will jump to first place again with quad core Itanium2), but IBM seems to have much better processors and architecture design (they are 50% faster having half of the cores HP has!), which is why they achieve linear scalability (from 4 to 64-core results are liniear). I’m talking about UNIX servers of course.

I think there will come a time when Oracle/HP demonstrate TPC-H (warehouse) results on Exadata storage. People will need to know the performance (and price) difference. For now, Oracle has said it’s 10 or 30 times faster, but nobody knows faster than what. How about giving exact system configurations?

Well, we could probably talk about it for next month, but the clue is to understand and know how to look at the benchmark results.

Vincent

Reply
21 kevinclosson October 3, 2008 at 4:32 pm

Vincent,

I was involved in competitive TPC-C with both Informix and Oracle even before Oracle could “legally” audit a TPC-C result (pre-Oracle 7.3 was not complaint). I know the workload, I know the scale characteristics. I know what Oracle (internally) has to do well to get a good number. There is nothing new to learn from that workload. If you throw more memory at it you get a bigger number.

I think you are speed reading my writings. I stated that the IBM 6 million TpmC result is 60x better than the best we did (10 years ago) and it used precisely 60x the memory but not 60x the CPU count and ***certainly*** not 60x the disk. In fact, it used about 5x the disk. Let me ask you this, if IBM could build a system with 8TB RAM, what do you think the number would be? I’ll answer: it would be 12 million TpmC. It would take some uninteresting amount of additional disk and some uninteresting amount of additional CPU, but the memory has been, and will remain, nearly constant. It is a boring arms race. That is my opinion and I’m sticking to it.

Reply
22 Vincent October 3, 2008 at 5:28 pm

Kevin,

I agree 100%. They have used 60x the memory amount and maybe not 60x as many processors, but we may assume they are 60x more powerful today than these days. Was the result 100k TPMC (6M/60)?

You say it’s all about memory. So why did they use maximum CPU configuration? Shouldn’t they go with 32 or 16 CPUs and have even lower price/performance ration result?

BTW, I did a quick and dirty calculations:

1. IBMp595 – 4TB RAM = 6M TPMC = 666kB RAM/TPMC
2. Superdome – 4TB RAM = 6M TPMC = 666kB RAM/TPMC
3. IBMp570 – 768GB RAM = 1.6M TPMC = 480kB RAM/TPMC
4. IBMx3850 – 256GB RAM = 684k TPMC = 374kB RAM/TPMC

What do you think? It’s getting inconsistent.

Vincent

Reply
23 Vincent October 3, 2008 at 5:48 pm

Kevin,

my mistake,

2. Superdome – 4TB RAM = 4M TPMC = 1000kB RAM/TPMC

so it’s about 1MB of RAM for single TPMC. It’s getting even worse 😦

Vincent

Reply
24 kevinclosson October 3, 2008 at 6:01 pm

Vincent,

Please, this thread on TPC-C is a side-bar to Exadata and is boring. No offense, but really, this is an old, tired workload. Did I say anywhere that there is a 60x formula cast in stone? No, I said that it is no surprise that 60x RAM yielded 60x performance (+-4%) over a 10 year old result. A ten year old result.

So, yes, you’ve found 2 results are don’t fit the magic-yet-nonexitent 60x RAM to TpmC ratio. But you did skip over the more recent IBM p595 result of 4,033,378 TpmC from 2TB memory (~540 KB/TpmC). The point is that all of these results 480KB/Tpmc->666KB/TpmC are all within 25% of the ratio used ***10 years ago***. Considering how much everything else has changed (e.g., the orders of magnitude increase in processor capability), I’d call this “near constant.” But then that sound like a “near miss” or “complete stop.”

Let’s drop it.

Reply
25 Eric October 4, 2008 at 7:08 pm

I’m not sure the TpmC/Memory ratio is meaningful as some have noted, however the TpmC/Dollar is somewhat meaningful, especially if one us using this number to make a purchasing decision today. This ultimately is how Oracle sales will be able to push this thing to market and into your shops.

Reply
26 kevinclosson October 5, 2008 at 1:40 am

$$/TpmC. Is that a new concept?

Reply

	David Zheng on Announcing pgio (The SLOB Meth…
	Oracle redo log perf… on File Systems For A Database? C…
	Oracle redo log perf… on Yes, File Systems Still Need T…
	kevinclosson on Announcing SLOB 2.5.4
	pgio nutzen? - I/O W… on So pgio Does Not Accurately Re…

Kevin Closson's Blog: Platforms, Databases and Storage