Archive for the 'oracle' Category



Oracle Exadata Storage Server FAQ – Part IV. A Peek at Storage Filtration and Column Projection.

This is installment number four in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit:

Oracle Exadata Storage Server Frequently Asked Questions Part I.

Oracle Exadata Storage Server Frequently Asked Questions Part II.

Oracle Exadata Storage Server. Frequently Asked Questions. Part III.

I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-phrasing them as questions. If they read as questions when I see then I cut and paste them without modification.

Q. What is meant in the Exadata white paper about Smart Scan Predicate Filtering by “…only rows where the employees hire date is after the specified date are sent from Exadata to the database instance..”? Does it really only return the rows matching the predicate or does it return all blocks containing rows which match the predicate? If the former is correct, how is this handled in the db block buffer?

A. Actually, that statement is a bit too liberal. Oracle Exadata Storage Server Smart Scans return only the cited columns from only the filtered rows, not entire rows as that statement may suggest. Results from a Smart Scan are not “real” database blocks thus the results do not get cached in the SGA. Consider the following business question:

List our Club Card members that have spent more than $1,000.00 at non-partner retail merchants in the last 180 days with our affinity credit card. Consider only non-partner merchants within 10 miles radius of one of our stores.

The SQL statement in the following text box answers this question (assuming 10-mile wide US postal code zones):

select    cf.custid, sum(act.purchase_amt) salesfrom all_card_trans act, cust_fact cf
where ( act.card_no like '4777%' or act.card_no  like '3333%' )
and act.card_no = cf.aff_cc_num and cf.club_card_num not like '0%'
and act.purchase_dt  >   to_date('07-MAR-2008','dd-mon-yyyy')
and act.merchant_zip in ( select distinct(zip) from our_stores)
and act.merchant_code not in (select merchant_code from partner_merchants)
group by custid having sum(act.purchase_amt) > 1000 ;

Intelligent Storage. Filtering and Projecting. Oh, and a Bloom Filter Too.

Now, let’s look at the predicate handling in the plan. The next text box shows that storage is filtering based on club cards, purchase dates and credit card. Given the way this query functions, all_card_trans is essentially like a fact table and cust_fact is more of a dimension table since there are 1,300 fold more rows in all_card_trans than cust_fact. You’ll also see there was a bloom filter pushed to the Exadata Storage Server cells used to join the filtered cust_fact rows to the filtered all_card_trans rows.

Predicate Information (identified by operation   id):---------------------------------------------------
3 -   filter(SUM("ACT"."PURCHASE_AMT")>1000)
7 -   access("ACT"."MERCHANT_CODE"="MERCHANT_CODE")
10 -   access("ACT"."MERCHANT_ZIP"="ZIP")
15 -   access("ACT"."CARD_NO"="CF"."AFF_CC_NUM")
20 -   storage("CF"."CLUB_CARD_NUM" NOT LIKE '0%')
filter("CF"."CLUB_CARD_NUM" NOT LIKE   '0%')
25 -   storage("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07   00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO"   LIKE '4777%' OR
"ACT"."CARD_NO" LIKE '3333%'))
filter("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07   00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO"   LIKE '4777%' OR
"ACT"."CARD_NO" LIKE '3333%') AND   SYS_OP_BLOOM_FILTER(:BF0000,"ACT"."CARD_NO"))

This example should make the answer a bit clearer. Storage is ripping through the rows, yes, but only returning named columns and heavily filtered data.

The following text box shows frozen output of a tool we used to monitor I/O at a high level across cells. The names of the Exadata Storage cells in this case are called sgbeta1s01-sgbeta1s06 (I’m like you, I have to scrounge for hardware). These are 2 3-second averages showing I/O in excess of 1,000,000 KB per cell or an aggregate of 6 GB/s (see the data under the “bi” column). This is, of course, what we’ve been referring to as a “half-rack” in spite of the fact that it only has 6 Exadata Storage Server cells instead of the 7 that would be a true half-rack. Nonetheless, this is a 4-table query doing real work and, as you can see in the last text box, I’m connected to a single instance of RAC-on a dual socket/quad-core server! Without Exadata, a server would have to have 15 fibre channel host bus adaptors to drive this query at this I/O rate and a great deal of CPU power. That would be a very large server-and it would still be slower.

Kids, don’t try this at home without supervision

procs -----------memory---------- ---swap-- -----io----  --system-- -----cpu------15:29:43:   r  b     swpd    free   buff     cache   si   so        bi    bo    in      cs us sy id wa st
sgbeta1s01: 4    0    204   63292 134928 1523732    0      0 1033237    24  5062 37992 25  3 72  0  0
sgbeta1s02: 8    0    204   65708 141800 1596124    0      0 1038949     9  5238 37869 25  3 72  0  0
sgbeta1s03: 4    0    208 1183780  57832    584988    0    0 1049712    35  5862 40258 26  3 69    2  0
sgbeta1s04: 2    0    204 1254456  79084    568052    0    0 1053093    13  5329 38462 25  3 72    0  0
sgbeta1s05: 4    0      0  144924 122252 1453324    0      0 1022032    29  4991 38164 25  3 72  0  0
sgbeta1s06:11    0      0   95808 124200 1418236    0      0 1000635    24  4767 38249 25  2 73  0  0
Minimum:   2  0      0   63292  57832    568052    0    0 1000635     9  4767 37869 25  2 69    0  0
Maximum:11  0    208 1254456 141800 1596124    0      0 1053093    35  5862 40258 26  3 73  2  0
Average:   5  0    136  467994 110016   1190742    0    0 1032943    22  5208 38499 25  2 71  0  0
procs -----------memory---------- ---swap-- -----io----  --system-- -----cpu------
15:29:46:   r  b     swpd    free   buff     cache   si   so        bi    bo    in      cs us sy id wa st
sgbeta1s01: 3    0    204   63012 134928 1523732    0      0 1005643    24  5099 38475 25  2 72  0  0
sgbeta1s02: 2    0    204   65188 141800 1596124    0      0 1011285     3  4867 37671 25  2 73  0  0
sgbeta1s03: 4    0    208 1184408  57832    584988    0    0 1005371    20  5775 39378 25  3 72    0  0
sgbeta1s04: 3    0    204 1255200  79084    568052    0    0    993264    13  4891 38240 24  3 73  0  0
sgbeta1s05: 5    0      0  144320 122256 1453324    0      0 1016859    29  4947 38073 25  3 72  0  0
sgbeta1s06: 6    0      0   95824 124200 1418236    0    0 1019323    24  4961 38224 25  2 73  0  0
Minimum:   2  0      0   63012  57832    568052    0    0    993264     3  4867 37671 24  2 72  0  0
Maximum:   6  0    208 1255200 141800 1596124      0    0 1019323    29    5775 39378 25  3 73  0    0
Average:   3  0    136  467992 110016   1190742    0    0 1008624    18  5090 38343 24  2 72  0  0
[oracle@sgbeta1c01]$ sqlpus ‘/ as sysdba'SQL*Plus: Release 11.1.0.7.0 - Production on Thu   Oct 2 15:35:55 2008
Copyright (c) 1982, 2008, Oracle.  All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release   11.1.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters,   OLAP, Data Mining
and Real Application Testing options
SQL> select instance_name from gv$instance ;
INSTANCE_NAME
----------------
test1
SQL>

Oracle Exadata Storage Server. Frequently Asked Questions. Part III.

This is installment number three in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit:

Exadata Storage Server Frequently Asked Questions Part I.

Exadata Storage Server Frequently Asked Questions Part II.

I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-phrasing them as questions. If they read as questions when I see then I cut and paste them without modification.

Q. Is there a coming Upgrading Guide kind of document or a step-by-step installation metalink note planned for the database machine?
A. HP Oracle Database Machine and Oracle Exadata Storage Servers are installed at the factory by HP.

Q The ODM spec sheets says there are four 24-port InfiniBand switches (96 total ports) in each DB machine. If each of the 8 RBMS hosts has 2 links to the switches and each Exadata server (14) also has two, then it is just 2×8 + 2 x 14= 44 links
A. Since the HP Oracle Database Machine is an appliance installed at the factory, by HP, I’m hesitant to go too deep in this area. The short answer is that the extra switches are there to address the loss of an entire HP Oracle Database Machine rack in a multi-rack scale-out configuration.

Q. How does this architecture deal with data distribution and redistribution? It seems like that’s still going to be a problem with joining data that isn’t distributed the same way. Does all the data then go back to the RAC?
A. Data distribution is a multifaceted topic. There is partition-wise data distribution and ASM extent distribution. Nonetheless, the answer is the same for both types of distribution: no change. ASM treats what we refer to as “grid disks” in Exadata Storage Cells no differently than disks in a SAN when it comes to laying out extents. Likewise, partitioning does not change. In fact, nothing about partitioning changes with Exadata.

If, for instance, you have data with poor data distribution (e.g., partitioning skew) with ASM on a SAN, it would be the same with Exadata-but at least the I/O would be extremely fast <smiley>

Exadata changes how data comes from disk, not how it is placed on disk.

Q. If I do a big query and sort, will that bottleneck one of the RAC nodes?
A. Exadata changes nothing about Oracle in this regard. Nonetheless, sorts are parallelized with Intra-node Parallel Query in a Real Applications Clusters environment so I’m at a loss for what you are referring to.

Q. Is temp space managed at the storage layer or on the RAC nodes?
A. Exadata changes nothing about Oracle in this regard. Temporary segments are a logical collection of extents in a file. It’s the same with or without Exadata.

Q. Not sure this is exactly in your field, but what does the cell do when it hits a block that was flushed to disk before being committed (ie needs an UNDO check)? Can it return a mix of rowset and block data so the DB server checks UNDO ?
A. Data consistency and transaction concurrency are not offloaded to Exadata Storage Servers. The integrity of the scan is controlled by the RDBMS. I think it is counterproductive to discuss the edge-cases where a Smart Scan will not be possible. If you are using a database as a data warehouse, you will get Smart Scans. If you are doing reporting against an active OLTP database, you will see queries that are not serviced by Smart Scans.

Smart Scans are optimized for data warehousing workloads and, just as is the case in non-Exadata environments, it is not good practice to be modifying the active query data set while running queries. Adding partitions and loading data while queries are running, sure, but changing a row here and there in a data warehouse doesn’t make much sense (at least to me).

Q. Suppose I have a server (e.g. linux) or a number of RAC nodes, how do I connect them to the Exadata and how do I access the disk space?
A. If you wish to adopt Exadata into an existing Oracle Database 11.1.0.7 environment there are SKUs for that. Talk to your sales representative and make room for Infiniband switches and HCAs.

Q. Do I need fibre or ethernet connections, switches, special hardware between my server and the Exadata?
A. Of course! Exadata is Infiniband based. You’ll at least need Infiniband HCAs and switches to get to the data stored in Exadata. Once you are up to the correct Oracle version you can run with non-Exadata and Exadata tablespaces side-by-side. This fact will aid migrations.

Q. Do I still need OS multipath software?
A. No. Well, not for Exadata.

Q. Do I see raw luns that I present to an ASM instance running on my own machine(s) or does my database communicates directly with the ASM on the Exadata?
A. ASM will have visibility to Exadata Storage Server “grid disks.” There happens to be a command line tool that makes it easy for me to illustrate the point. In the following text box I’ve cut and pasted session output from an xterm where I used the kfod tool to list all known Exadata Storage Server grid disks and grep’ed for ones I named “data1” on cell number 6 of the configuration. To further illustrate the point I then changed directories to list the DDL I used to incorporate all “data1” grid disks in the configuration into an ASM disk group called “DATA1.” Other than the fact that DATA1 is a candidate for Smart Scan, there is really nothing different between this disk group and any other Oracle Database 11g ASM disk group.

$ kfod -disk all | grep ‘data1.*cell06’

241: 117760 Mb o/192.168.50.32:5042/data1_CD_10_cell06 <unknown> <unknown>

242: 117760 Mb o/192.168.50.32:5042/data1_CD_11_cell06 <unknown> <unknown>

243: 117760 Mb o/192.168.50.32:5042/data1_CD_12_cell06 <unknown> <unknown>

244: 117760 Mb o/192.168.50.32:5042/data1_CD_1_cell06 <unknown> <unknown>

245: 117760 Mb o/192.168.50.32:5042/data1_CD_2_cell06 <unknown> <unknown>

246: 117760 Mb o/192.168.50.32:5042/data1_CD_3_cell06 <unknown> <unknown>

247: 117760 Mb o/192.168.50.32:5042/data1_CD_4_cell06 <unknown> <unknown>

248: 117760 Mb o/192.168.50.32:5042/data1_CD_5_cell06 <unknown> <unknown>

249: 117760 Mb o/192.168.50.32:5042/data1_CD_6_cell06 <unknown> <unknown>

250: 117760 Mb o/192.168.50.32:5042/data1_CD_7_cell06 <unknown> <unknown>

251: 117760 Mb o/192.168.50.32:5042/data1_CD_8_cell06 <unknown> <unknown>

252: 117760 Mb o/192.168.50.32:5042/data1_CD_9_cell06 <unknown> <unknown>

$ cd $ORACLE_HOME/dbs

$ cat cr_data1_dg.sql

create diskgroup DATA1 normal redundancy

DISK ‘o/*/*data1*’

ATTRIBUTE

‘AU_SIZE’ = ‘4M’,

‘CELL.SMART_SCAN_CAPABLE’=’TRUE’,

‘compatible.rdbms’=’11.1.0.7’,

‘compatible.asm’=’11.1.0.7’

/

Q. Do I still have to struggle with raw devices on os level?
A. No.

Q. Can I create multiple databases in the available space?
A. Absolutely. I haven’t even started blogging about I/O Resource Management features of Exadata. This is the only platform that can prevent multiple applications from stealing resources from each other-all the way down to physical I/O.

Q. Do I still need to create asm disks or diskgroups, or do I just see one large asm disk of e.g. 168Tb?
A. Physical disks in Exadata Storage Server cells are “carved” up into what we refer to as grid disks. Each grid disk becomes an ASM disk. The fewest ASM disks you could end up with in a full-rack HP Oracle Database Machine is 168.

Q. […] don’t some of the DW vendors split the data up in a shared nothing method. Thus when the data has to be repartitioned it gets expensive. Whereas here you just add another cell and ASM goes to work in the background. (depending upon the ASM power level you set.)
A. All the DW Appliance vendors implement shared-nothing so, yes, the data is chopped up into physical partitions. If you add hardware to increase performance of queries against your current dataset the data will have to be reloaded into the new partitioning scheme. As has always been the case with ASM, adding new disks-and therefore Exadata Storage Server cells-will cause the existing data to be redistributed automatically over all (including the new) drives. This ASM data redistribution is an online function.

Q. [regarding] Supportability – Oracle software support has always been spotty. Now with a combination of Oracle Linux, Oracle database and HP hardware, it is going to be interesting to see how it all comes together – especially upgrades, patches etc.
A. Support is provided via a single phone number.

Q. How easy or difficult is it to maintain? Do we need to build specialized skills inhouse or is it hands-off like Teradata?
A. In my reckoning, you need to the same Oracle data warehousing skills you need to day, plus a primer on Exadata.

Q. [regarding] Ease of use – Can I simply move an existing oracle warehouse instance to the new database machine and can use it day 1? How easy or difficult is it? Do I need to spend significant time like with a RAC instance – partitioning etc?
A. Data from an existing data warehouse will have to be physically moved into an Exadata environment. You will be either moving entirely from one environment to another (e.g., 10g on Unix to Exadata with Linux) or adding Exadata Storage to your existing environment and copying the data into Exadata storage. The former would be done in the same manner as any cross-platform migration while the latter would require the warehouse be upgraded to Oracle Database 11g Release 11.1.0.7. Once upgraded to 11.1.0.7 and Infiniband connectivity is sorted out, the data can then copied with the simplicity of a CTAS operation or other such operation.

Q.The Exadata storage concept is excellent – more storage comes with additional CPU and Cache – Can we use it for non-oracle applications – such as Log processing etc?
A. Anything that can go into an Oracle Database can go into Exadata. So such features as SecureFiles are supported. Exadata is not scalable general-purpose storage.

Q. Why would I want to use Oracle rather than Teradata or Netezza which is proven?
A. Because, perhaps, the data you are extracting to load into Netezza is coming from an Oracle Database? There are a lot of answers to this question I suppose. In the end, I should think the choice would be based foremost on performance. Most of Netezza’s customers are either Oracle customers as well, or have migrated from Oracle. I think in Netezza’s “early days” the question was likely reversed. We aim to reverse the question.

Q. Backup using RMAN – RMAN backups are not really geared for big databases, so is there any other off host alternatives available?
A. The data stored in Exadata is under ASM control. The same backup restrictions apply for Exadata as any other ASM deployment.

Oracle Exadata Storage Server Related Web News Media and Blog Errata. Part I.

This is just a quick blog entry to correct some minor (and some not-so-minor) errors I’ve stumbled upon in blog and Web news media.

Exadata Storage Server Gross Capacity

In his recent Computerworld article, Eric Lai was trying to put some flesh on the bones as it were. It is a good article, but a few bits need correction/clarification. First, Eric wrote:

The Exadata Storage Server, a standard rack-mountable HP Proliant DL180 G5 server sporting two Intel quad-core CPUs connected to 12 hard drives of 1TB each.

The HP Oracle Database Machine has 2 hard drive options-SAS and SATA. The SAS option is comprised of 12 300GB 15,000 RPM drives. The SATA option is indeed 12 drives each 1TB in size.

HP Oracle Database Machine Communications Bandwidth

Later in the article the topic of bandwidth between the RDBMS hosts and Exadata Storage Servers was covered. The author wrote:

[…] users can expect a real-world bandwidth today of 1Gbit/sec, which he claimed is far faster than conventional disk storage arrays.

The HP Oracle Database Machine has 1 active and one failover Infiniband path between each of the RDBMS hosts (for RAC inter-process communication) and from each RDBMS host to each Exadata Storage Server. Each path offers 20Gb bandwidth which is more aggregate streaming disk I/O than a Exadata cell can produce for up-stream delivery as I explain in question #2 two of my Exadata Storage Server FAQ. Since the disks can “only” generate roughly 1GB of streaming data, we routinely state that users can expect real-world bandwidth today of 1GB/second. Note the difference in notation (e.g., GB vs Gb) because it accounts for nearly one order of magnitude difference.

A Rack Half-Full or Half-Empty?

When discussing the Beta testing activity, the author quoted Larry Ellison as having said the following at his Keynote address:

A number of Oracle customers have been testing the Machine for a year, putting their actual production workloads onto half-sized Oracle Database Machines, because we’re really cheap

I was there and he did say it. While I may be dumb (actually I can talk), I am not stupid enough to “correct” Larry Ellison. Nonetheless, when Larry said the Beta participants were delivered a half-sized HP Oracle Database Machine he was actually being too generous. He said we sent a half configuration because “we’re really cheap”, but in fact we must be even cheaper because, while we sent them half the number of RDBMS hosts, we sent them 6 Exadata Storage Servers as opposed to 7, which would be exactly half a Database Machine.

Good for the Goose, Good for the Gander

Finally, on my very own blog (in this post even!) I have routinely stated the wire bandwidth of the Infiniband network with which we interconnect Oracle Database instances and Oracle Database instances to Oracle Exadata Storage Server cells as being 20Gb/s. Of course with all communications protocols there is a difference between the wire-rate and the payload bandwidth. One of my blog readers commented as follows:

Yet another minor nit ;) As I commented elsewhere, 20Gb/s is the IB baud rate, the useful bit rate is 16Gb/s (8b/10b encoding). I am not sure why the IB folks keep using the baud numbers.

And to that I replied:

Not minor nits at all. Thanks. We have used pretty poor precision when it comes to this topic. Let me offer mitigating circumstances.

While the pipe is indeed limited to 16Gb payload (20% less than our routinely stated 20Gb), that is still nearly twice the amount of data a cell can produce by streaming I/O from disk in the first place. So, shame on us for being 20% off in that regard, but kudos to us for making the pipes nearly 100% too big?

HP Oracle Database Machine. A Thing of Beauty Capable of “Real Throughput!”

As they say, a blog without photographs is simply boring. Here is a picture of a single-rack HP Oracle Database Machine. It is stuffed with 8 nodes for Real Application Clusters and 14 Oracle Exadata Storage Servers with 168 3.5″ SAS hard drives. My lab work on a SAS-version of one just like this yields 13.6 GB/s throughput for table scans with offloaded filtration and column projection.

The next photo is a shot of (from the left) Mike Hallas, Greg Rahn and myself in the Moscone North Demo of the HP Oracle Database Machine. Mike and Greg are in the Oracle Real-World Performance Group. Great guys!

Real Throughput or Effective Throughput?

Mike worked (jointly with Bob Carlin) on the latest scale-out Proof of Concept that drove a mulit-rack HP Oracle Database Machine to 70 GB/s scanning tables with 4.28:1 compression—or in terms used more commonly by The Competition™—299.6 GB/s. Of course 300 GB/s is the effective scan rate, but be aware that The Competition™ often times expresses their throughput using their effective throughput. I don’t play that game. I’ll say if it is throughput or effective throughput. Wordy, I know, but I’m not as short-winded as The Competition™ it seems.

I’ve blogged about the Real-World Performance Group (under the esteemed Andrew Holdsworth) before. Those guys are awesome! Come to think of it, I have to bestow the “A” word on the MAA team as well in spite of the fact that Mike Nowak was “too busy” to catch a beer with me during the entire OW week. That’s weak! 🙂

Oracle Exadata Storage Server. Frequently Asked Questions. Part II

This is installment number two in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit Exadata Storage Server Frequently Asked Questions Part I. I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-wording them as questions.

Later today The Pythian Group will be conducting a podcast question and answer interview with me that will be available on their website shortly thereafter. I’ll post a link to that when it is available.

Questions and Answers

Q. [I’m] willing to bet this is a full-blown Oracle instance running on each Exabyte [sic] Storage Server.

A. No, bad bet. Exadata Storage Server Software is not an Oracle Database instance. I happened to have an xterm with a shell process sitting in a directory with the self-extracting binary distribution file in it. We can tell by the size of the file that there is no room for a full-blown Oracle Database distribution:

$ ls -l cell*

-rwxr-xr-x 1 root root 206411729 Sep 12 22:04 cell-080905-1-rpm.bin

Q. This must certainly be a difficult product to install, right?

A. HP installs the software on their manufacturing floor. Nonetheless I’ll point out that installing Oracle Exadata Storage Server Software is a single execution of the binary distribution file without options or arguments. Further, initializing a Cell is a single command with two options of which only one requires an argument; such as the following example where I specify a bonded Infiniband interface for interconnect 1:

# cellcli

CellCLI: Release 11.1.3.0.0 – Production on Fri Sep 26 10:56:17 PDT 2008

Copyright (c) 2007, 2008, Oracle. All rights reserved.

Cell Efficiency Ratio: 10,956.8

CellCLI> create cell cell01 interconnect1=bond0

After this command completes I’ve got a valid cell. There are no preparatory commands (e.g., disk partitioning, volume management, etc).

Q. I’m trying to grasp whether this is really just being pitched at the BI and data warehouse space, or whether it has real value in the OLTP space as well.

A. Oracle Exadata Storage Server is the best block server for Oracle Database, bar none. That being said, in the current release, Exadata Storage Server is in fact optimized for DW/BI workloads, not OLTP.

Q. I know we shouldn’t set too much store in these things, but are there plans to submit TPC benchmarks?

A. You are right that there should not be as much stock placed in TPC benchmarks, but they are a necessary evil. I don’t work in that space, but could you imagine Oracle not doing some audited benchmarks? Seems unlikely to me.

On the topic of TPC benchmarks, I was taking a gander at the latest move in the TPC-C “arms race.” This IBM RS600 p595 result of 6,085,166 TpmC proves that the TPC-C is not (has never been) an I/O efficiency benchmark. If you throw more gear at it, you get a bigger number! Great!

How about a stroll down memory lane.

When I was in Database Engineering in Sequent Computer Systems back in 1998, Sequent published a world-record Oracle TPC-C result on our NUMA system. We achieved 93,901 TpmC using 64GB main memory. The 6,085,166 IBM number I just cited used 4TB main memory. So how fulfilling do you think it must be to do that much work on a TPC-C just to prove that in 10 years nothing has changed! The Sequent result comes in at 1 TpmC per 714KB main memory and the IBM result at 1 TpmC per 705KB main memory. Now that’s what I want to do for a living! Build a system with 10,992 disk drives and tons of other kit just to beat a 10-year-old result by 1.3%. Yes, we are now totally convinced that if you throw more memory at the workload you get a bigger number! In the words of Gomer Pyle, “Soo-prise, Soo-prise, Soo-prise.” Ok, enough of that, I don’t like arms-race benchmarks.

Q. From the Oracle Exadata white paper: “No cell-to-cell communication is ever done or required in an Exadata configuration.”and a few paragraphs later: “Data is mirrored across cells to ensure that the failure of a cell will not cause loss of data, or inhibit data accessibility” Can both these statements be true and would we need to purchase a minimum of two cells for a small-ish ASM environment?

A. Cells are entirely autonomous and the two statements are true indeed. Consider two ASM disks out in a Fibre Channel SAN. Of course we know those two disks are not “aware of each other” just because ASM is using blocks from each to perform mirroring. The same is true for Oracle Exadata Storage Server cells and the drives housed inside them. As for the second part of the questions, yes, you must have a minimum of two cells. In spite of the fact that Cells are shared nothing (unaware of each other), ASM is in fact Cell-aware. ASM is intelligent enough to not mirror between 2 drives in the same Cell.


Q. Can this secret sauce help with write speeds?

A. That depends. If you have a workload suffering from the loss of processor cycles associated with standard Unix/Linux I/O libraries then, sure. If you have an application that uses storage provisioned from an overburdened back-end Fibre Channel disk loop (due to application collusion) then, sure. Strictly speaking, the “secret sauce” is the Oracle Exadata Storage Server Software and it does not have any features for write acceleration. Any benefit would have to come from the fact that the I/O pipes are ridiculously fast and the I/O protocol is ridiculously lightweight and the system on a whole is naturally balanced. I’ll blog about the I/O Resource Management (IORM) feature of Exadata soon as I feel it has positive attributes that will help OLTP applications. Although it is not an acceleration feature, it eliminates situations where applications steal storage bandwidth from each other.

Q. I like your initial overview of the product, but I believe that you need to compare both Netezza and Exadata side by side in real-world scenarios to gauge their performance.

A. I partially agree. I cannot go and buy a Netezza and legally produce competitive benchmark results based on the gear. Just read any EULA for any storage management software and you’ll see the bold print. Now that doesn’t mean Oracle’s competitors don’t do that. I think the comparison will come in the form of reduced Netezza sales. Heaven knows the 16% drop in Netezza stock was not as brutal as I expected.

Q. Re. [your] comparison to Netezza [in your first Exadata related post]. It’s bit of apple to oranges, really. You assume 80MB/s per disk for Exadata and for some reason only 70MB/s per disk for Netezza. Also, you have 168 disks spinning in parallel on Exadata and 112 on Netezza. Had your assumptions been tha same, sequential IO throughput would be similar, at least theoretically.

A. Reader, I invite you to explain to us how you think native SATA 7,200 RPM disk drives are going to match 15K RPM SAS drives. When I put 70 MB/s into the equation I was giving quite a benefit of the doubt (as if I’ve never measured SATA performance). Please, if you have a Netezza let us know how much streaming I/O you get from a 7,200 RPM SATA drive once you read beyond the first few outside sectors. I have also been using the more conservative 80 MB/s for our SAS drives. I’m highballing SATA and low-balling SAS. That sounds fair to me. As for the comparison between the numbers of drives, well, Netezza packaging limits the drive (SPU) count to 112 per cabinet. It would suit me fine if it takes a 1 plus another half rack to match a single HP Oracle Database Machine. That empty half of the rack would be annoying from a space constraint point of view though. Nonetheless, if you did go with a rack and a half (168 SPU), would that somehow cancel out the base difference in drive performance between SATA and SAS?

Oracle Exadata Storage Server. Part II.

I have to run over and man a live single-rack HP Oracle Database Machine Demonstration in Moscone North, so I thought I’d take just a moment to post some links to more official Oracle information on Oracle Exadata Storage Server:

Main Oracle Exadata Storage Server Webpage

Oracle Exadata Storage Server Product Whitepaper

I plan to start a FAQ-style series of blog posts regarding the HP Oracle Database Machine and Oracle Exadata Storage Server as soonas possible.

Yesterday was a big day for me, and the extremely talented team I work with. Having the pleasure of doing performance architecture work on Oracle’s most important new product in ages (my humble opinion) has been quite an adventure. One of the last Oracle Exadata Storage Server tasks I worked on prior to the release was a Proof of Concept for Winter Corporation. Expect the report from that work to be available in the next few days. I’ll post a link to that when it is ready.

Oracle Exadata Storage Server. Part I.

Brute Force with Brains.
Here is a brief overview of the Oracle Exadata Storage Server key performance attributes:

  • Intelligent Storage. Ship less data due to query intelligence in the storage.
  • Bigger Pipes. Infiniband with Remote Direct Memory Access. 5x Faster than Fibre Channel.
  • More Pipes. Scalable, redundant I/O Fabric.

Yes, it’s called Oracle Exadata Storage Server and it really was worth the wait. I know it is going to take a while for the message to settle in, but I would like to take my first blog post on the topic of Oracle Exadata Storage Server to reiterate the primary value propositions of the solution.

  • Exadata is fully optimized disk I/O. Full stop! For far too long, it has been too difficult to configure ample I/O bandwidth for Oracle, and far too difficult to configure storage so that the physical disk accesses are sequential.
  • Exadata is intelligent storage. For far too long, Oracle Database has had to ingest full blocks of data from disk for query processing, wasting precious host processor cycles to discard the uninteresting data (predicate filtering and column projection).

Oracle Exadata Storage Server is Brute Force. A Brawny solution.
A single rack of the HP Oracle Database Machine (based on Oracle Exadata Storage Server Software) is configured with 14 Oracle Exadata Storage Server “Cells” each with 12 3.5″ hard drives for a total of 168 disks. There are 300GB SAS and 1TB SATA options. The database tier of the single-rack HP Oracle Database Machine consists of 8 Proliant DL360 servers with 2 Xeon 54XX quad-core processors and 32 GB RAM running Oracle Real Application Clusters (RAC). The RAC nodes are interconnected with Infiniband using the very lightweight Reliable Datagram Sockets (RDS) protocol. RDS over Infiniband is also the I/O fabric between the RAC nodes and the Storage Cells. With the SAS storage option, the HP Oracle Database Machine offers roughly 1 terabyte of optimal user addressable space per Storage Cell-14 TB total.

Sequential I/O
Exadata I/O is a blend of random seeks followed by a series of large transfer requests so scanning disk at rates of nearly 85 MB/s per disk drive (1000 MB/s per Storage Cell) is easily achieved. With 14 Exadata Storage Cells, the data-scanning rate is 14 GB/s. Yes, roughly 80 seconds to scan a terabyte-and that is with the base HP Oracle Database Machine configuration. Oracle Exadata Storage Software offers these scan rates on both tables and indexes and partitioning is, of course, fully supported-as is compression.

Comparison to “Old School”
Let me put Oracle Exadata Storage Server performance into perspective by drawing a comparison to Fibre Channel SAN technology. The building block of all native Fibre Channel SAN arrays is the Fibre Channel Arbitrated Loop (FCAL) to which the disk drives are connected. Some arrays support as few as 2 of these “back-end” loops, larger arrays support as many as 64. Most, if not all, current SAN arrays support 4 Gb FCAL back-end loops which are limited to no more than 400MB/s of read bandwidth. The drives connected to the loops have front-end Fibre Channel electronics and-forgetting FC-SATA drives for a moment-the drives themselves are fundamentally the same as SAS drives-given the same capacity and rotational speed. It turns out that SAS and Fibre drives, of the 300GB 15K RPM variety, perform pretty much the same for large sequential I/O. Given the bandwidth of the drives, the task of building a SAN-based system that isn’t loop-bottlenecked requires limiting the number of drives per loop to 5 (or 10 for mirroring overhead). So, to match a single rack configuration of the HP Oracle Database Machine with a SAN solution would require about 35 back-end drive loops! All of this math boils down to one thing: a very, very large high-end SAN array.

Choices, Choices: Either the Largest SAN Array or the Smallest HP Oracle Database Machine
Only the largest of the high-end SAN arrays can match the base HP Oracle Database Machine I/O bandwidth. And this is provided the SAN array processors can actually pass through all the I/O generated from a full complement of back-end FCAL loops. Generally speaking, they just don’t have enough array processor bandwidth to do so.

Comparison to the “New Guys on the Block”
Well, they aren’t really that new. I’m talking about Netezza. Their smallest full rack has 112 Snippet Processing Units (SPU) each with a single SATA disk drive-and onboard processor and FPGA components-for a total user addressable space of 12.5 TB. If the data streamed off the SATA drives at, say, 70 MB/s, the solution offers 7.8 GB/s-42% slower than a single-rack HP Oracle Database Machine.

Big, Efficient Pipes
Oracle Exadata Storage Server delivers I/O results directly into the address space of the Oracle Database Parallel Query Option processes using the Reliable Datagram Sockets (RDS) protocol over Infiniband. As such, each of the Oracle Real Application Clusters nodes are able to ingest a little over a gigabyte of streaming data per second at a CPU cost of less than 5%, which is less than the typical cost of interfacing with Fibre Channel host-bus adaptors via traditional Unix/Linux I/O calls. With Oracle Exadata Storage Server, the Oracle Database host processing power is neither wasted on filtering out uninteresting data, nor plucking out columns from the rows. There would, of course, be no need to project in a colum-oriented database but Oracle Database is still row-oriented.

Oracle Exadata Storage Server is Intelligence Storage. Brainy Software.
Oracle Exadata Storage Server truly is an optimized way to stream data to Oracle Database. However, none of the traditional Oracle Database features (e.g., partitioning, indexing, compression, Backup/Restore, Disaster Protection, etc) are lost when deploying Exadata. Combining data elimination (via partitioning) with compression further exploits the core architectural strengths of Exadata. But what about this intelligence? Well, as we all know, queries don’t join all the columns and few queries ever run without a WHERE predicate for filtration. With Exadata that intelligence is offloaded to storage. Exadata Storage Cells execute intelligent software that understands how to perform filtration as well as column projection. For instance, consider a query that cites 2 columns nestled in the middle of a 100-column  row and the WHERE predicate filters out 50% of the rows. With Exadata, that is exactly what is returned to the Oracle Parallel Query processes.

By this time it should start to make sense why I have blogged in the past the way I do about SAN technology, such as this post about SAN disk/array bottlenecking.  Configuring a high-bandwidth SAN requires a lot of care.

Yes, this is a very short, technically-light blog entry about Oracle Exadata Storage Server, but this is day one. I didn’t touch on any of the other really exciting things Exadata does in the areas of I/O Resource Management, offloaded online backup and offloaded join filters, but I will.

Oracle OpenWorld Bound

Where’s Waldo?

As infrequently as I’ve posted over the last few months I’m sort of surprised I even have any readers remaining!

I will be at OpenWorld and I’d love to meet up with as many of you as I can. I’ll be working the Oracle Demo Ground in Moscone North on late Wednesday afternoon and Thursday morning. Until that point I’ll be attending sessions and catching up with a lot of folks I only get to see at the show these days. Feel free to send me an email at the address listed in my contact section of the blog.

If you are one of the people who like, or dislike, my positions on Fibre Channel SANS (i.e., the “Manly Man Series”), or want to talk more about why most Oracle shops aren’t realizing hard drive bandwidth, then send me a note and we’ll see if we can chat.

I’m really looking forward to the show this year. There seems to be significant buzz about the show, as this ComputerWorld.com article will attest.

Don’t forget to stop by the official OpenWorld Blog.

I Know Nothing About Data Warehouse Appliances and Now, So Won’t You – Part IV. Microsoft takes over DATAllegro.

It looks like my blog entries about DATAllegro (such as this piece about DATAllegro and magic 4GFC throughput) are going to start to sound a wee bit different:

Microsoft buys DATAllegro

Oracle Database 10g 10.2.0.4 Cannot Boot a Large SGA on AMD Servers Running Linux

In the comment thread of my recent blog entry entitled Of Gag-Orders, Excitement, and New Products, a fellow blogger, Jeff Hunter wrote:

I’d be happy if the major innovation was being able to run a 10.2.0.4 16G SGA on x86_64.

He offered a link to a thread on his blog where he has been chronicling his unsuccessful attempts to boot a 16GB SGA on the same iron that seemed to have no problem doing so with 10.2.0.3.

What’s New?

Oracle Database 10g release 10.2.0.4 has additional rudimentary support for NUMA in the Linux port, true, but Jeff has tried with NUMA enabled and disabled (via boot options) none of which has fixed his problems. In his latest installment on this thread I noticed that the title of the post has renamed the thread to “The Great NUMA debate” and the post ends with Jeff reporting that he still is having trouble with his 16GB SGA, but also that he can’t boot even a 4GB SGA. Jeff wrote:

I still couldn’t start a 16GB SGA. Interestingly enough, I couldn’t start a 4G SGA either! I had to go back to booting without numa=off. The saga continues…

Unfortunately, I can’t jump in and debug what is wrong on his configuration and I don’t know what the debate is. However, I can take a moment to post evidence that Oracle Database 10g 10.2.0.4 can in fact boot a 16GB SGA-in both AMD Opteron SUMA mode and NUMA mode. No, I don’t have any large memory AMD systems around to test this myself. But I certainly use to. So, I decided to call in a favor to my old friend Mary Meredith (yes, old Sequent folks stick together) who has taken over for me in the role I vacated at HP/PolyServe when left to join Oracle. I asked Mary if she’d mind booting a 16GB SGA on one of those large memory AMD systems I use to have available to me…and she did:

$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.4.0 - Production on Mon Jul 6 09:15:35 2008
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.
Connected to an idle instance.
SQL> startup pfile=create1.ora
ORACLE instance started.
Total System Global Area 1.7700E+10 bytes
Fixed Size                  2115104 bytes
Variable Size             503319008 bytes
Database Buffers         1.7180E+10 bytes
Redo Buffers               14659584 bytes
Database mounted.
Database opened.

$ numactl --hardware
available: 1 nodes (0-0)
node 0 size: 32146 MB
node 0 free: 13821 MB
node distances:
node   0
  0:  10

So, here we see 10.2.0.4 on a SUMA-configured Proliant DL585 with a 16GB buffer pool. I asked Mary if she’d be willing to boot in NUMA mode (Linux boot option) and give it a try, and she did:

$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.4.0 - Production on Mon Jul 7 10:03:35 2008
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.
Connected to an idle instance.
SQL> startup pfile=create1.ora
ORACLE instance started.
Total System Global Area 1.7700E+10 bytes
Fixed Size                  2115104 bytes
Variable Size             503319008 bytes
Database Buffers         1.7180E+10 bytes
Redo Buffers               14659584 bytes
Database mounted.
Database opened.
SQL> quit

But she reported that she didn’t get any hugepages:

$ cat /proc/meminfo|grep Huge
HugePages_Total:  8182
HugePages_Free:   8182
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

I pointed out that 8192 2MB hugepages is not big enough. I recommended she up that to 8500 and then start the database up under strace so we could capture the shmget() call to ensure it was flagging in SHM_HUGETLB, and it was:

$ cat /proc/meminfo|grep Huge
HugePages_Total:  8500
HugePages_Free:   7132
HugePages_Rsvd:   7073
Hugepagesize:     2048 kB

And from the strace:

6510  shmget(0x1420f290, 17702060032, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0600) = 393219

And…

$ ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0x1420f290 393219     oracle    600        17702060032 12

Also, in the NUMA configuration we see a good, even distribution of pages allocated from each of the “nodes”, with the exception of node zero which until Linux gets fully NUMA-aware will always be over-consumed:

$ numactl --hardware
available: 4 nodes (0-3)
node 0 size: 7906 MB
node 0 free: 2025 MB
node 1 size: 8080 MB
node 1 free: 3920 MB
node 2 size: 8080 MB
node 2 free: 3969 MB
node 3 size: 8080 MB
node 3 free: 3926 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

We also see that the shmget() call did flag in SHM_HUGETLB and correspondingly we see the shmkey in the ipcs output. We also see hugepages being used, although mostly just reserved.

So, I haven’t been able to see Jeff’s strace output or other such diagnostic information so I can’t help there. However, this blog post is meant to be a confidence booster to any wayward googler who might happen to be having difficulty booting a VLM SGA on AMD Opteron running Linux with Oracle Database 10g release 10.2.0.4.

Extra Credit

So, if Mary had booted in NUMA mode without hugepages, does anyone think it would have resulted in such a nice even consumption of pages from the nodes, or would it have looked like Cyclops? We all recall Cyclops, don’t we? In case you don’t here is a link:
Oracle on Opteron with Linux–The NUMA Angle Part VI. Introducing Cyclops.

Oracle Database Doesn’t Use Hugepages Correctly. What’s Better, Reserved or Used?

I’ve received questions about HugePages_Rsvd a few times in the last few months. After googling for HugePages_Rsvd +Oracle and not seeing a whole lot, I thought I’d put out this quick blog entry.

Here I have a system with 600 hugepages reserved:

# cat /proc/meminfo | grep HugePages
HugePages_Total: 600
HugePages_Free: 600
HugePages_Rsvd: 0

Next, I boot up this 1.007GB SGA:

SQL*Plus: Release 11.1.0.6.0 - Production on Tue Jul 8 11:25:14 2008

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 1081520128 bytes
Fixed Size                  2166960 bytes
Variable Size             339742544 bytes
Database Buffers          734003200 bytes
Redo Buffers                5607424 bytes
Database mounted.
Database opened.
SQL>

Booting this SGA only used up 324 pages:

#  cat /proc/meminfo | grep HugePages
HugePages_Total:   600
HugePages_Free:    276
HugePages_Rsvd:    195

If my buffers are 700 MB and my variable SGA component is 324 MB, why weren’t 512 hugepages used? Let’s see what happens when I start using some buffers and library cache. I’ll run catalog.sql and catproc.sql and then check hugepages again:

#  cat /proc/meminfo | grep HugePages
HugePages_Total:   600
HugePages_Free:    237
HugePages_Rsvd:    156

That used up another 39 hugepages, or 78 MB. At this point my SGA usage still leaves about 305 MB of unbacked virtual memory. If I were to run some OLTP, the rest would get allocated. The idea here is that it really makes no sense to do the allocation overhead until the pages are actually touched. It makes no sense to go to all the trouble in VM land if the pages might never be used. Think about an errant program that allocates a sizable amount of hugepages just to rapidly die. While that’s not Oracle, the Linux guys have to keep a pretty general-purpose mindset. This really goes back to the olden days of Unix when folks argued the virtues of pre-allocating swap to ensure there would never be a condition where a swap-out couldn’t be satisfied. The problem with that approach was that before calls like vfork() became popular there was a ton of overhead on large systems just to retire VM resources of very short lived processes, such as those which fork() only to immediately exec().

OK, so that was a light-reading blog entry, but some googler, someday, might find it interesting.

Yes, that was a come-on title…so surprising, isn’t it? 🙂

I Ain’t Not Too Purdie Smart, But I Know One Thing For Certain: MAA Literature is Required Reading!

You Need to See What These Folks Have to Say

It is hereby official! I absolutely must put out a plug for the MAA team and the fruits of their labor now that I have personally worked with them on a project. I’m sure it’s no credit to them, per se, but honestly, this team is really, really sharp!

Go get some of those papers!

I Know Nothing About Data Warehouse Appliances and Now, So Won’t You – Part III. Tuning Data Warehouse Appliances.

I spent a little time last night perusing Stuart Frost’s blog (CEO, DATAllegro) and learned something new. Microsoft, it appears, has ported Windows and SQL Server to platforms beyond x86, x86_64 and IA64. I quote:

Database vendors such as Oracle and Microsoft have to build their software to run on any hardware. Hence there are a plethora of tuning parameters and options for the DBA and sys admins to setup.

No, MSFT products do not run on enough platforms to somehow make them difficult to tune.

Oracle’s port list has gotten “quite small” over the years due to the death of all the niche players (Sequent, Pyramid, SGI, Data General, etc). The 10gR2 list is down to 20 ports according to OTN. And, yes, deploying the same database software on a 4 CPU platform and a 128 CPU platform in the same day might make most Oracle professionals give a little extra consideration to certain tuning parameters. I don’t think that is a weakness on the part of Oracle though.

From what I can see of DATAllegro, the primary ingredient in the DATAllegro secret sauce is strong focus on getting full bandwidth from all the drives. That is a difficult value proposition to argue with, but the topic is certainly nothing new as my post entitled Hard Drives Are Arcane Technology. So Why Can’t I Realize Their Full Bandwidth Potential? will attest.

Tuning Your Toaster or Refrigerator

So this whole blog entry was to call out Stuart Frost’s comment that insinuted Oracle is difficult to deal with because it is ported to so many platforms. I hate to break the news, but platform specific Oracle tunables (i.e., init.ora) have been on the steep downhill trend since Oracle8i. They are considered very undesirable, but they do, for obvious reasons, exist in some ports. Having said that, how does having a few extra port-specific tunables in, say, the HP-UX port supposedly make life more difficult for an Oracle DBA working in a Linux shop? It doesn’t. It is a red herring.

If you think the fact that DATAllegro is marketed as an appliance somehow limits it tunables to the degree of your toaster or refrigerator, just remember that there is Ingres in there and you can feel free to read the 37 pages in the Ingres DBA Guide dedicated to storage structures alone.

I’m not too smart, but I know for certain that my refrigerator didn’t come with 37 pages of documentation explaining the ice maker attachment.

I Know Nothing About Data Warehouse Appliances and Now, So Won’t You – Part II. DATAllegro Supercharges Fibre Channel Performance.

BLOG CORRECTION: The next to the last paragragh has been edited to offer more clarity on which components impose limits on I/O transfer sizes.

I’m going to tell you something nobody else knows. You’ve heard it here first. Ready? Here’s the deal, no more than 800 MB/s can pass through two 4 Gb Fibre Channel HBAs into any host system memory. It’s that simple. If you want more than 800 MB/s available for your CPUs, you have to either add more 4 Gb HBAs or go with 8 Gb Fibre, or drop FCP all together and go with something that can deliver at that level, but this isn’t a plug for the Manly Man Series on Fibre Channel Technology, I’m blogging about Data Warehouse Appliance technology, specifically DATAllegro.

Exit Conventional Wisdom, and Electronics!

Here is a graphic of the V3 DATAllegro building block. It’s two Dell 2950s (a.k.a., Compute Nodes) each plumbed with two 4 Gb Fibre Channel HBAs to a small EMC CX3 array. According to this piece on DATAllegro’s website, they are the only people on the planet to push more than is electronically possible through two 4 Gb HBAs, I quote:

Data for each compute node is partitioned into six files on dedicated disks with a shared storage node. Multi-core allows each of these six partitions to be read in parallel. Data is streamed off these partitions using DATAllegro Direct Data StreamingTM (DDS) technology that maximizes sequential reads from each disk in the array. DDS ensures the appliance architecture is not I/O bound and therefore pegged by the rate of improvement of storage technology. As a result, read rates of over 1.2 GBps per compute node are possible.

That’s right. I wasn’t going to point out that each compute node is fed by six disks, because if I did I’d also have to tell you they are 7200 RPM SATA drives, mirrored. Supposedly we are to believe that the pixy dust known as Direct Data StreamingTM can, uh, pull data at what rate per spindle? Yes, that’s right, they say 200 MB/s per drive! Folks, I’ve got 7200 LFF SATA drives all over the place and you can’t get more than 80 MB/s per drive from these things (and that is actually fairly tough to do). Even EMC’s own specification sheet for the CX3 spells out the limit as 31-64 MB/s. I’ll attest that if your code stays out on the outer, say, 10% of the drive you can stream as much as 75-80 MB/s from these things. So with the DATAllegro system, and using my best numbers (not EMC’s published numbers), you’d only expect to get some 480 MB/s from 6 7200 RPM SATA drives (6×80). Wow, that Direct Data StreamingTM technology must be really cool, albeit totally cloak and dagger. Let’s not stop there.

What about this 1.2 GB/s per compute node claim? How do you pump that through 2 x 4 Gb FC HBAs? You don’t. Not even DATAllegro with all those Cool SoundingTM technologies. What’s really being said in that DATAllegro overview piece is that their effective ingestion rate is some 1.2 GB/s, I quote:

Compression expands throughput: Within each node, two of the multi-core processors are reserved for software compression. This increases I/O throughput from 800MBps from the shared storage node to over 1.2 GBps for each compute node.

They could just come out and say it, but they expect you to believe in magic. I’ll quote Stuart Frost (CEO, DATAllegro) on more of this magic, secret sauce:

Another very important aspect of performance is ensuring sequential reads under a complex workload. Traditional databases do not do a good job in this area – even though some of the management tools might tell you that they are! What we typically see is that the combination of RAID arrays and intervening storage infrastructure conspires to break even large reads by the database into very small reads against each disk.

Traditional databases are only victims of what storage arrays do with the I/O requests by way of slicing and dicing. Further, the OS and FC HBA impose limits for the size of large I/O requests. It is not a characteristic of a traditional database system. Even a Totally Rad Non-Traditional RDBMSTM like the one DATAllegro embeds in their compute nodes (spoiler: it’s Ingres, nothing new) will fall prey to what the array controller does with large I/O requests. But more to the point, FC HBAs and the Linux (CentOS for DATAllegro) block I/O layer impose limits on the size of transfers and that is generally 1MB.

If I’m wrong, I expect DATAllegro to educate us, with proof, not more implied Awesomely Fabulicious CoolFlips Technology TM. In the end, however, no matter whether they managed to code custom FC HBA drivers and somehow obtained custom firmware for the CX3 to achieve larger transfer sizes than anyone else or not, I’ll bet dollars to donuts they can’t push more than 800 MB/s through dual 4 Gb FCP HBAs, and certainly not from 6 7200 RPM SATA drives.

I Know Nothing About Data Warehouse Appliances, and Now, So Won’t You – Part I

I’ve been watching all these come-lately DW/BI technologies for a while now-especially the ever-so-highly-revered “appliances.” I’m also interested in columnar orientation as my past posts on columnar technology (e.g., columnar technology I, columnar technology II) will attest.

Rows and Columns, or Columns and Rows?

I don’t know, because in that famed Unfrozen Caveman Lawyer style, these things confuse me. However, Stuart Frost, CEO of DATAllegro, puts it this way in his fledgling blog:

At the end of the day, column orientation is just one approach to limiting the amount of data read for a given query. In effect, it’s an extreme form of vertical partitioning of the data. In modern row-oriented systems such as DATAllegro, we use sophisticated horizontal partitioning to limit the number of rows read for each query.

Clue’isms are Truisms

Huh? “Sophisticated horizontal partitioning?” Now that is a novel approach. And if all I want to scan is a column or two with Oracle, I’ll create an index. Is it really that much more complicated than that? An index is columnar representation after all. Heck, I could even partition that “columnar representation” with a sophisticated horizontal partitioning technology (that has been in Oracle since the early 1990s) to further reduce the data ingestion cost.

Indexes == Anathema

Oops, I should wash my mouth out with soap. After all, the “appliances” shall save you from the torment of creating a few indexes, right? Well, maybe not. The term of the day is “Index-Light Appliance.”

So I have to ask, what if I were to implement an Oracle-based data warehouse that used, say, 5 indexes. Would that be an Index-Light approach?

Oracle is taking steps to make the configuration of hardware for a DW/BI deployment a bit simpler. If you haven’t yet seen it, the Optimized Warehouse Initiative is worth investigating.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.