Kevin Closson's Blog: Platforms, Databases and Storage

Archive Page 21

Oracle Exadata Storage Server. No Magic in an Imperfect World. Excellent Tools and Really Fast I/O Though.

Published October 9, 2008 oracle 8 Comments

I’ve met people before who’d rather drink muriatic acid than admit a mistake. I don’t cotton to such folks none too purdy good. This is a post about perfection and imperfection.

Cost Based Optimizer

There is no such thing as a perfect cost-based optimizer (CBO) and the expectations placed on cost-based optimizers run rampant. If the Oracle Database 11g CBO was perfect there would be no need for an excellent performance tuning tool like SQL Tuning Advisor and Reactive Tuning. In those cases where CBO does not generate a perfect plan, Oracle Database equips administrators with a tuning toolset that generally gets it right.

What’s This Have to do with Oracle Exadata Storage Server?

In my blog entry entitled Oracle Exadata Storage Server: A Black Box with No Statistics, I discussed the Affinity Card Program Test Database (ACPTD). One of the reasons I use this schema and query set is that, unlike a benchmark kit, it is not perfect. I expect Exadata to function as advertised in an imperfect world.

Imagine an organization that inherited some imperfect data with an imperfect schema. Nah, that would never happen. What about queries that aren’t perfect? Ever seen any of those? What about imperfect query plans?

One of the business questions, in English, that I test with the Affinity Card Program Test Database reads as follows:

List our customers with Club Cards who purchased more than $1,000.00 worth of goods/services in the west retail region. Omit non-affinity card purchases and all restaurant, travel and gasoline purchases.

The following text box has the SQL for this business question. Now, before you get all excited about the usage of the mcc.misc column (line 5 in the text box), please remember what I just explained in the previous paragraph. Sometimes I like imperfection when I’m analyzing performance. But what does this have to do with Exadata Storage Server? Well, I could certainly use a codes table and anti-join the purchases that are not restaurant, travel or gasoline. No question there. However, for this particular query I wanted imperfection, not a lesson in schema design. So, the purpose behind the not like ‘restaurant%travel%gasoline%’ predicate is to test a more heavily weighted filtration effort in the Exadata Storage Server. It’s synthetic, I know, but at least it is imperfect-which I like.

with act as
(
select   act.card_no, act.purchase_amt from zipcodes z ,all_card_trans act ,mcc m
where    (act.card_no like '4777%' or act.card_no like '3333%') and
act.mcc = m.mcc and m.misc not like 'restaurant%travel%gasoline%' and
act.merchant_zip = z.zip and z.state in ('CA', 'OR', 'WA')
)
select
cf.custid,
sum (act.purchase_amt) sales from  act,cust_fact cf
where act.card_no = cf.aff_cc_num and cf.club_card_num not like '0000%'
group by cf.custid
having   sum (act.purchase_amt) > 1000;

Notice in the following text box how the plan shows Exadata Storage Server is filtering out all_card_trans, cust_fact, zipcodes and mcc rows. The filtering effort on mcc is not colossal, but certainly heftier than the filtration on zip.state, since there are synthetic data-cleanliness values in mcc.misc such as ‘restaurant%travel%gas oline’ and so forth.


Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(SUM("ACT"."PURCHASE_AMT")>1000)
7 - access("ACT"."CARD_NO"="CF"."AFF_CC_NUM")
10 - access("ACT"."MERCHANT_ZIP"="Z"."ZIP")
13 - access("ACT"."MCC"="M"."MCC")
18 - storage("ACT"."CARD_NO" LIKE '4777%' OR "ACT"."CARD_NO" LIKE '3333%')
filter("ACT"."CARD_NO" LIKE '4777%' OR "ACT"."CARD_NO" LIKE '3333%')
23 - storage("M"."MISC" NOT LIKE 'restaurant%travel%gasoline%' AND SYS_OP_BLOOM_FILTER(:BF0000,"M"."MCC"))
filter("M"."MISC" NOT LIKE 'restaurant%travel%gasoline%' AND SYS_OP_BLOOM_FILTER(:BF0000,"M"."MCC"))
25 - storage("Z"."STATE"='CA' OR "Z"."STATE"='OR' OR "Z"."STATE"='WA')
filter("Z"."STATE"='CA' OR "Z"."STATE"='OR' OR "Z"."STATE"='WA')
27 - storage("CF"."CLUB_CARD_NUM" NOT LIKE '0000%')
filter("CF"."CLUB_CARD_NUM" NOT LIKE '0000%')

What Does This Have to do with Cost Based Optimizer and SQL Tuning Advisor?

The fact that Exadata does not leave any of the core Oracle Database 11g value propositions off the table is a point that has been made in the press and blogging community over and over again. I’m sure I’ll get spanked for saying this, but occasional bad plans are a fact of life with cost-based optimizers-and that includes Oracle. Those of us willing to accept and live with this reality are very appreciative of the goodness offered by the SQL Tuning Advisor. There are several canned queries in the ACPDT. The other day I loaded the scale that leaves me with an ALL_CARD_TRANS table of roughly 3TB. Of the several queries, I was disappointed with the performance of only one-the query listed above with the imperfect method for weeding out merchant codes. Since this is Exadata you can rest assured that the I/O was not a problem, but the query took 721 seconds to complete! I knew that was in error because there is partition elimination on this query along all_card_trans.card_no and 721 seconds of Oracle Exadata Storage Server Smart Scan throughput on this 6-cell configuration would rack up over 4TB of total I/O (6GB/s*721). Oracle Database does not mistakingly read an entire table when partition pruning is possible, so what was the issue?

The following text box shows the plan. Notice the estimate in step 18. The plan, it seems, didn’t have the best information about the relationship between ALL_CARD_TRANS and CUST_FACT.


------------------------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                                  | Name           | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

------------------------------------------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                           |                |    15 |  1395 |   207K  (2)| 00:48:32 |        |      |            |

|   1 |  PX COORDINATOR                            |                |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)                      | :TQ10005       |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,05 | P->S | QC (RAND)  |

|*  3 |    FILTER                                  |                |       |       |            |          |  Q1,05 | PCWC |            |

|   4 |     HASH GROUP BY                          |                |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,05 | PCWP |            |

|   5 |      PX RECEIVE                            |                |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,05 | PCWP |            |

|   6 |       PX SEND HASH                         | :TQ10004       |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,04 | P->P | HASH       |

|*  7 |        HASH JOIN                           |                |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,04 | PCWP |            |

|   8 |         PX RECEIVE                         |                |     2 |   104 |   206K  (2)| 00:48:11 |  Q1,04 | PCWP |            |

|   9 |          PX SEND BROADCAST                 | :TQ10003       |     2 |   104 |   206K  (2)| 00:48:11 |  Q1,03 | P->P | BROADCAST  |

|* 10 |           HASH JOIN                        |                |     2 |   104 |   206K  (2)| 00:48:11 |  Q1,03 | PCWP |            |

|  11 |            PX RECEIVE                      |                |     2 |    88 |   206K  (2)| 00:48:11 |  Q1,03 | PCWP |            |

|  12 |             PX SEND BROADCAST              | :TQ10002       |     2 |    88 |   206K  (2)| 00:48:11 |  Q1,02 | P->P | BROADCAST  |

|* 13 |              HASH JOIN BUFFERED            |                |     2 |    88 |   206K  (2)| 00:48:11 |  Q1,02 | PCWP |            |

|  14 |               JOIN FILTER CREATE           | :BF0000        |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,02 | PCWP |            |

|  15 |                PX RECEIVE                  |                |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,02 | PCWP |            |

|  16 |                 PX SEND HASH               | :TQ10000       |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,00 | P->P | HASH       |

|  17 |                  PX BLOCK ITERATOR         |                |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,00 | PCWC |            |

|* 18 |                   TABLE ACCESS STORAGE FULL| ALL_CARD_TRANS |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,00 | PCWP |            |

|  19 |               PX RECEIVE                   |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,02 | PCWP |            |

|  20 |                PX SEND HASH                | :TQ10001       |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | P->P | HASH       |

|  21 |                 JOIN FILTER USE            | :BF0000        |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | PCWP |            |

|  22 |                  PX BLOCK ITERATOR         |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | PCWC |            |

|* 23 |                   TABLE ACCESS STORAGE FULL| MCC            |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | PCWP |            |

|  24 |            PX BLOCK ITERATOR               |                |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,03 | PCWC |            |

|* 25 |             TABLE ACCESS STORAGE FULL      | ZIPCODES       |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,03 | PCWP |            |

|  26 |         PX BLOCK ITERATOR                  |                |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,04 | PCWC |            |

|* 27 |          TABLE ACCESS STORAGE FULL         | CUST_FACT      |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,04 | PCWP |            |

------------------------------------------------------------------------------------------------------------------------------------------

So, I executed the query under the SQL Tuning Advisor API as follows.


var tname varchar2(500);

exec :tname := dbms_sqltune.create_tuning_task(sql_text=>' with act as ( select act.card_no, act.purchase_amt from  zipcodes z ,all_card_trans act ,mcc m where

(act.card_no like ''4777%'' or act.card_no like ''3333%'') and act.mcc = m.mcc and m.misc not like ''restaurant%travel%gasoline%'' and act.merchant_zip = z.zip a

nd z.state in (''CA'', ''OR'', ''WA'')) select cf.custid, sum (act.purchase_amt) sales from act,cust_fact cf where act.card_no = cf.aff_cc_num and cf.club_card_n

um not like ''0000%'' group by cf.custid having   sum (act.purchase_amt) > 1000');

exec dbms_sqltune.execute_tuning_task(:tname);

select dbms_sqltune.report_tuning_task(:tname) AS recommendations from dual;

SQL Tuning Advisor did recommend a different plan and, as you can see in the following text box, the primary difference is in the cardinality estimate on ALL_CARD_TRANS. With this learned information the join order changed and the result was a 6.5X speedup as the query completed in 110 seconds.


---------------------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                               | Name           | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

---------------------------------------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                        |                |     1 |    93 |   208K  (2)| 00:48:33 |        |      |            |

|   1 |  PX COORDINATOR                         |                |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)                   | :TQ10005       |     1 |    93 |   208K  (2)| 00:48:33 |  Q1,05 | P->S | QC (RAND)  |

|*  3 |    FILTER                               |                |       |       |            |          |  Q1,05 | PCWC |            |

|   4 |     HASH GROUP BY                       |                |     1 |    93 |   208K  (2)| 00:48:33 |  Q1,05 | PCWP |            |

|   5 |      PX RECEIVE                         |                |    86M|  7640M|   207K  (2)| 00:48:32 |  Q1,05 | PCWP |            |

|   6 |       PX SEND HASH                      | :TQ10004       |    86M|  7640M|   207K  (2)| 00:48:32 |  Q1,04 | P->P | HASH       |

|*  7 |        HASH JOIN BUFFERED               |                |    86M|  7640M|   207K  (2)| 00:48:32 |  Q1,04 | PCWP |            |

|   8 |         PX RECEIVE                      |                |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,04 | PCWP |            |

|   9 |          PX SEND HASH                   | :TQ10002       |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,02 | P->P | HASH       |

|  10 |           PX BLOCK ITERATOR             |                |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,02 | PCWC |            |

|* 11 |            TABLE ACCESS STORAGE FULL    | CUST_FACT      |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,02 | PCWP |            |

|  12 |         PX RECEIVE                      |                |  6802K|   337M|   206K  (2)| 00:48:11 |  Q1,04 | PCWP |            |

|  13 |          PX SEND HASH                   | :TQ10003       |  6802K|   337M|   206K  (2)| 00:48:11 |  Q1,03 | P->P | HASH       |

|* 14 |           HASH JOIN                     |                |  6802K|   337M|   206K  (2)| 00:48:11 |  Q1,03 | PCWP |            |

|  15 |            PX RECEIVE                   |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,03 | PCWP |            |

|  16 |             PX SEND BROADCAST           | :TQ10000       |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,00 | P->P | BROADCAST  |

|  17 |              PX BLOCK ITERATOR          |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,00 | PCWC |            |

|* 18 |               TABLE ACCESS STORAGE FULL | MCC            |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,00 | PCWP |            |

|* 19 |            HASH JOIN                    |                |  7220K|   254M|   206K  (2)| 00:48:10 |  Q1,03 | PCWP |            |

|  20 |             PX RECEIVE                  |                |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,03 | PCWP |            |

|  21 |              PX SEND BROADCAST          | :TQ10001       |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,01 | P->P | BROADCAST  |

|  22 |               PX BLOCK ITERATOR         |                |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,01 | PCWC |            |

|* 23 |                TABLE ACCESS STORAGE FULL| ZIPCODES       |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,01 | PCWP |            |

|  24 |             PX BLOCK ITERATOR           |                |    78M|  2179M|   206K  (2)| 00:48:09 |  Q1,03 | PCWC |            |

|* 25 |              TABLE ACCESS STORAGE FULL  | ALL_CARD_TRANS |    78M|  2179M|   206K  (2)| 00:48:09 |  Q1,03 | PCWP |            |

---------------------------------------------------------------------------------------------------------------------------------------

Summary

After a brief bout with an imperfect query plan, the Oracle Database 11g SQL Tuning Advisor fixed what ailed me. With the plan offered me by SQL Tuning Advisor I was able to push my imperfect query through my imperfect schema and a perfect I/O rate of 6 GB/s for my small 6-cell Oracle Exadata Storage Server test configuration.

In spite of how many times EMC’s Chuck Hollis may claim that there is “nothing new” or “no magic” when referring to Oracle Exadata Storage Server, I think it is painfully obvious that there is indeed “something new” here. Is it magic? No, and we don’t claim that it’s magic. What we do claim is that we have moved filtration and projection down to storage and, more importantly, we don’t bottleneck disks (see Hard Drives Are Arcane Technology. So Why Can’t I Realize Their Full Bandwidth Potential?). So while we don’t claim Exadata is magic, there is one thing this blog post shows-all the pre-existing magic, like SQL Tuning Advisor, is still in the arsenal. When you deploy Exadata you leave nothing at the door.

Finally, a no-magic approach to something really new.

Oracle Exadata Storage Server: A Black Box with No Statistics.

A question came in about whether it is possible to measure how much data is filtered out when running a query serviced by a Smart Scan in the Exadata Storage Server grid. The following is the long answer.

An Example of Offload Processing Effectiveness Accounting

I’d like to answer this question by taking real information from a test configuration consisting of 4 Real Application Clusters nodes attached to 6 Exadata Storage Server cells with the SAS disk option (12 x 300GB 3.5″ 15K RPM).

Test Workload – The Affinity Card Program Test Database

The Affinity Card Program Test Database (ACPTD) is a test schema and set of queries that mimics the type of queries performed by a marketing group responsible for promoting the use of a retail club card and affinity credit card for a fictitious company. In this example deployment of the ACPTD, the group responsible for promoting club card and affinity credit card activity for Acme Inc has built a data mart with data from several sources, including the main corporate ERP and CRM systems; partner merchant ERP systems; and, outside syndicated data.

For this blog post I’ll focus on the main syndicated card transaction table called all_card_trans. However, there are other tables in the schema and for reference sake the following text box shows the table sizes in my current deployment of the test kit. As the query shows, the all_card_trans table is 613 GB. Yes, I know that’s small, but this is a test system and I don’t like watching 6 Exadata cells hammering 1 GB/s each for tens of minutes when a couple of minutes will do. I hope you’ll understand.

SQL> set echo on
SQL> @tab_sizes
SQL> col segment_name for a32
SQL> select segment_name, sum(bytes)/1024/1024 mb from user_segments
  2  where segment_name not like 'BIN%' and segment_name not like 'SYS%'
  3  group by segment_name;

SEGMENT_NAME                             MB
-------------------------------- ----------
PARTNER_MERCHANTS                         8
PARTNER_MERCHANT_SALES                67780
ZIPCODES                                  8
OUR_SALES                             21956
ALL_CARD_TRANS                       629032
MCC                                       8
OUR_STORES                                8
CUST_SERVICE                             76
TM                                   .03125
CUST_FACT                              4708

10 rows selected.

The table definition for the all_card_trans table is shown in the following text box:

SQL> desc all_card_trans
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CARD_NO                                   NOT NULL VARCHAR2(20)
 CARD_TYPE                                          CHAR(20)
 MCC                                       NOT NULL NUMBER(6)
 PURCHASE_AMT                              NOT NULL NUMBER(6)
 PURCHASE_DT                               NOT NULL DATE
 MERCHANT_CODE                             NOT NULL NUMBER(7)
 MERCHANT_CITY                             NOT NULL VARCHAR2(40)
 MERCHANT_STATE                            NOT NULL CHAR(3)
 MERCHANT_ZIP                              NOT NULL NUMBER(6)

The following queries show the total row count in the all_card_trans table as well as the number of rows representing transactions with credit cards with leading 4 digits of 3333 and 4447, which represent the card block granted to our fictitious company, Acme Inc, for their affinity card program. This is a needle in a haystack query since only .3% of the data matches the where predicate. Now, before you look at the text box please note that this is a single instance of Real Application Clusters driving this query. And, yes, the RAC node is a small 8-core DL360. Both scans are driving storage at 6 GB/s (613GB/102s). Folks, remember that to do what I just did in the following text box with Fibre Channel I’d have to have a system with 15 FC HBAs attached to several huge SAN arrays. Oh, and I should point out that it would also require about 85 Netezza Snippet Processing Units to match this throughput.

SQL> select count(*) from all_card_trans where  card_no like '4777%' or card_no like '3333%’;

  COUNT(*)
----------
  22465830

Elapsed: 00:01:42.71
SQL> select count(*) from all_card_trans;

  COUNT(*)
----------
7897611641

Elapsed: 00:01:42.47

I Can’t Believe I Ate the Whole Thing

The following query uses a method for measuring cell activity by forcing the database to ingest all rows and all columns. Don’t try to make sense of the query because it isn’t supposed to make sense. It is only supposed to drive Exadata to return all rows and all columns, which it does.

$ more demo_offload_bytes.sql

drop view ingest;
create view ingest as
select avg(length(CARD_NO)) c1,
avg(length(CARD_TYPE)) c2,
max(MCC + PURCHASE_AMT) c3,
max(length(PURCHASE_DT)) c4,
max(MERCHANT_CODE) c5,
avg(length(MERCHANT_CITY)) c6,
avg(length(MERCHANT_STATE)) c7,
max(MERCHANT_ZIP) c8 from all_card_trans;

col MB format 99999999.9
select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 );

select c1+c2+c3+c4+c5+c6+c7 from ingest;

select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);

The following text box shows that when I ran the query Exadata (in aggregate) scanned 613 GB of disk and returned all but 6% to the RDBMS instance (delta cell physical IO interconnect bytes). Also recorded by v$sysstat (bytes eligible) is the fact that there was nothing peculiar about the data being scanned-peculiar in any such fashion that would have interfered with offload processing (e.g., chained rows, etc). Since I asked for all the data, that is what I got. It is nice to know, however, that the entirety of the data was a candidate for offload processing.

Notice I didn’t time the query. I’ll offer more on that later in this blog post.

SQL> @demo_offload_bytes

View dropped.

View created.

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           1227.80814
cell physical IO interconnect bytes                              1151.52453
cell physical IO bytes eligible for predicate offload            1227.31451

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                            1841.5774
cell physical IO interconnect bytes                              1727.15227
cell physical IO bytes eligible for predicate offload            1840.97177

So let’s take a look at some offload processing. The following change to the “ingest” query reduces the amount of data ingested by the database tier by only selecting rows where the credit card number started with 4777 or 3333. We know from the scan of the table that this whittles down the dataset by 99.7%.

$ more demo_offload_bytes1.sql

create view ingest_less as
select avg(length(CARD_NO)) c1,
avg(length(CARD_TYPE)) c2,
max(MCC + PURCHASE_AMT) c3,
max(length(PURCHASE_DT)) c4,
max(MERCHANT_CODE) c5,
avg(length(MERCHANT_CITY)) c6,
avg(length(MERCHANT_STATE)) c7,
max(MERCHANT_ZIP) c8 from all_card_trans act
where card_no like '4777%' or card_no like '3333%';

col MB format 99999999.9
select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 );

set timing on
select c1+c2+c3+c4+c5+c6+c7 from ingest_less;

select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);

The following text box shows that that Exadata scanned and filtered out the uninteresting rows. The projected columns accounted for only 1.7 GB across the iDB intelligent I/O fabric. That is, the database tier only had to ingest 1.7 GB, or 17 MB/s since the query completed in 103 seconds.

SQL> @demo_offload_bytes_1

View dropped.

Elapsed: 00:00:00.21

View created.

Elapsed: 00:00:00.00

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           1841.13743
cell physical IO interconnect bytes                              43.1782733
cell physical IO bytes eligible for predicate offload            1840.97177

Elapsed: 00:00:00.01

C1+C2+C3+C4+C5+C6+C7
--------------------
           9010106.9

Elapsed: 00:01:43.31

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           2454.80203
cell physical IO interconnect bytes                              44.8776852
cell physical IO bytes eligible for predicate offload            2454.62903

Elapsed: 00:00:00.00
SQL>

So, yes, Exadata does let you see how effectively Smart Scans are working. In this example we saw a single-table scan on a small Oracle Exadata Storage Server grid weeding out 99.7% of the data.

Sneaky, Sneaky.

You’ll notice when I executed demo_offload_bytes.sql (the non-filtering example) I was not tracking query completion time. That’s because up to this point I’ve only been showing examples driven by a single Real Application Clusters node. Now, let’s think about this. I have 8 Intel processor cores and a single inbound Infiniband path on the database server. From an inbound I/O bandwidth perspective the host can “only” ingest 1.6GB/s, but the CPUs may also further throttle that since the database is doing its own projection in this case.

I’ve shown, in this post, that this test Exadata Storage grid configuration can scan disk at 6 GB/s. The question you might ask-and I’m about to answer-is how much does this single database host throttle storage and why does that matter? Well, it matters because having ample storage bandwidth with limited database server bandwidth is the classic imbalance addressed by the HP Oracle Database Machine. So, let me run it again-with timing turned on. As you’ll see, bottlenecks are bottlenecks. This single database host cannot keep up with storage, so storage is throttled and the result is 18X increase in query completion time when compared to the heavily filtered case. Both queries had to read the same amount of data but in this case there was an imbalance in the upstream ability to ingest the data both from an I/O bandwidth and CPU perspective.

SQL> @demo_offload_bytes

View dropped.

Elapsed: 00:00:00.02

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           2454.81084
cell physical IO interconnect bytes                               44.886501
cell physical IO bytes eligible for predicate offload            2454.62903

Elapsed: 00:00:00.00

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59

Elapsed: 00:29:26.85

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3068.55792
cell physical IO interconnect bytes                              620.492259
cell physical IO bytes eligible for predicate offload            3068.28629

Elapsed: 00:00:00.00

Need More RDBMS Bandwidth? OK, Use It.

So, let’s see what the full-ingestion case looks like with 4 Real Application Clusters nodes. And please forgive that I forgot to aggregate the v$sysstat output. I’ve added 4-fold database grid resources so I should complete the full-ingestion query in 75% less time. The text box starts out showing that the number of RAC instances was increased from 1 to 4.

SQL> host date
Fri Oct  3 16:27:52 PDT 2008

SQL> select instance_name from gv$instance;

INSTANCE_NAME
----------------
test1

Elapsed: 00:00:00.02
SQL> select instance_name from gv$instance;

INSTANCE_NAME
----------------
test1
test4
test3
test2

Elapsed: 00:00:00.08
SQL> host date
Fri Oct  3 16:32:06 PDT 2008
SQL>
SQL> @demo_offload_bytes

View dropped.

Elapsed: 00:00:00.03

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3068.69843
cell physical IO interconnect bytes                              620.632769
cell physical IO bytes eligible for predicate offload            3068.28629
physical IO disk bytes                                           .013385296
cell physical IO interconnect bytes                              .013385296
cell physical IO bytes eligible for predicate offload                     0
physical IO disk bytes                                           .010355473
cell physical IO interconnect bytes                              .010355473
cell physical IO bytes eligible for predicate offload                     0
physical IO disk bytes                                           .031275272
cell physical IO interconnect bytes                              .031275272

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload                     0

12 rows selected.

Elapsed: 00:00:00.02

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59
Elapsed: 00:07:25.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3222.13893
cell physical IO interconnect bytes                              764.535856
cell physical IO bytes eligible for predicate offload            3221.70425
physical IO disk bytes                                            154.31714
cell physical IO interconnect bytes                              144.732708
cell physical IO bytes eligible for predicate offload            154.280945
physical IO disk bytes                                           152.860648
cell physical IO interconnect bytes                              143.367385
cell physical IO bytes eligible for predicate offload            152.828033
physical IO disk bytes                                           153.195674
cell physical IO interconnect bytes                                143.6831

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload             153.13031

12 rows selected.

Elapsed: 00:00:00.02
SQL>

Well, that was predictable, but cool nonetheless. What about the filtered ingest-query? Should there be much of a speedup given the fact that storage bandwidth remains constant and the ingest rate of the filtered query was only roughly 17 MB/s with a single Real Application Clusters node? I could save the cut and paste effort and just tell you that adding Real Application Clusters nodes in this case will, of course, not reduce the query time, but since I’ve bored you this long, here it is:

SQL> @demo_offload_bytes_1

View dropped.

Elapsed: 00:00:00.01

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3222.15758
cell physical IO interconnect bytes                              764.554509
cell physical IO bytes eligible for predicate offload            3221.70425
physical IO disk bytes                                           154.334342
cell physical IO interconnect bytes                               144.74991
cell physical IO bytes eligible for predicate offload            154.280945
physical IO disk bytes                                           153.213285
cell physical IO interconnect bytes                               143.70071
cell physical IO bytes eligible for predicate offload             153.13031
physical IO disk bytes                                           152.878347
cell physical IO interconnect bytes                              143.385084

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload            152.828033

12 rows selected.

Elapsed: 00:00:00.01

C1+C2+C3+C4+C5+C6+C7
--------------------
           9010106.9

Elapsed: 00:01:42.79

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3375.27835
cell physical IO interconnect bytes                              764.974159
cell physical IO bytes eligible for predicate offload            3374.81995
physical IO disk bytes                                             308.0746
cell physical IO interconnect bytes                              144.133529
cell physical IO bytes eligible for predicate offload            307.986572
physical IO disk bytes                                           310.634206
cell physical IO interconnect bytes                              145.189768
cell physical IO bytes eligible for predicate offload              310.5755
physical IO disk bytes                                           302.274069
cell physical IO interconnect bytes                              143.805239

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload            302.218781

12 rows selected.

Elapsed: 00:00:00.01
SQL> host date
Fri Oct  3 16:49:32 PDT 2008

Summary

Exadata works and has statistics so that the solution doesn’t feel so much like a black box. In the end, however, none of the statistcs stuff really matters. What matters is whether your queries complete in the desired service times.

Podcast: Pythian Group Oracle Exadata Storage Server Q&A with Kevin Closson.

Published October 3, 2008 oracle 4 Comments

The Pythian Group recently conducted a short podcast interview with me to discuss a few topics related to Oracle Exadata Storage Server and the HP Oracle Database Machine. Sadly, the audio is troublesome in some segments, due to telephony issues, but in general I think you may find it informative. Here is the link:

The Pythian Group Interview: Kevin Closson on the Oracle Exadata Storage Server. Part I.

Oracle Exadata Storage Server FAQ – Part V. Sweet and Sour Disk.

Published October 3, 2008 oracle 2 Comments

This is installment number five in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit The Index of my other Exadata posts. I’m mostly cutting and pasting questions from the comment threads of my blog posts about Oracle Exadata Storage Server and the HP Oracle Database Machine and mixing in some assertions I’ve seen on the web (and re-phrasing them as questions). If they read as questions when I see them I cut and paste them without modification.

Q. […] for joining large datasets, such as a fact tables anda very large dimension, we currently encourage the use of equipartitioning on the join key to reduce messaging (CPU) and the volume of data distributed to PQ slaves (memory/storage).Would there be (a) a benefit and (b) a mechanism for ensuring that matching partitions of commonly joined tables are colocated on the same cell, so that the join can be performed entirely at the storage level?

A. The answer to both (a) and (b) is no. There is no way to co-locate table or indexes on a cell given the fact that we use ASM to stripe and mirror between cells. To do this we would have to replicate partitions (in entirety) across cells-for redundancy. ASM is unaware of what type of data is stored in blocks. Besides, the only type of “join” we do at this stage is based on bloom filters. Bloom filters are created in the database tier (grid) and then pushed down to storage.

Q. Where can I find the official documentation for the Exadata Product Family? I could only find datasheet and white paper.

A. Documentation is only provided to licensed customers.

Q. How [does] the optimizer decide whether to use an index or a Smart Scan for predicate filtering? I guess he will choose the faster one, but what’s faster and when?

A. Ah, but we don’t call is Smart Table Scan. Exadata Storage Server uses Smart Scans on indexes with the same brute force and intelligence (e.g., filtration) as it does with tables.

Q. Why use the outer part of the disk for everything? I.e., why not use the slower part for mirroring?

A. This question is in reference to the fact that we recommend folks allocate the outer portions of their disks to disk groups that will contain “hot” data. Strictly speaking, there is a way to do this with Automatic Storage Management Failure Groups. I tend not to use Failure Groups to address what the reader is asking because the origin of that feature was to equip administrators with the necessary tools to do the hard work of ensuring ASM mirrors between separate controllers and/or cabinets, etc. Exadata is much more aware of the underlying storage so this level of admin effort is not needed.

Automatic Storage Management mirrors the contents of the logical management unit (a disk group) and does not mirror between disk groups unless you have Failure Groups. If we supported a RAID 0+1 (with BCL) style approach between disk groups then I would put a hot-mirror-side disk group on the sweet sectors and a cold-mirror-side disk group closer to the spindles and only read the cold-mirror-side in the event of a failure. But, we don’t do that with singleton disk groups. Instead, we do a quasi-RAID 1+0 **within** the disk group. As such we consume sweet disk for both the primary and secondary extents. In the event of a loss, I’d say our current approach is better because the application will continue to be serviced with I/O from sweet sectors. On the other hand, if there is never a loss we are consuming sweet disk for naught. It is a trade-off.

The other problem with a RAID 0+1 (hot-to-cold) striped-then-mirrored approach is suffered by OLTP workloads because with the typical read/write ratio of OLTP we’d be wildly flailing between the sweet and sour sectors to satisfy the writes. Remember, we are not a one-pony show.

Oracle Exadata Storage Server FAQ – Part IV. A Peek at Storage Filtration and Column Projection.

Published October 2, 2008 oracle 7 Comments

This is installment number four in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit:

Oracle Exadata Storage Server Frequently Asked Questions Part I.

Oracle Exadata Storage Server Frequently Asked Questions Part II.

Oracle Exadata Storage Server. Frequently Asked Questions. Part III.

I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-phrasing them as questions. If they read as questions when I see then I cut and paste them without modification.

Q. What is meant in the Exadata white paper about Smart Scan Predicate Filtering by “…only rows where the employees hire date is after the specified date are sent from Exadata to the database instance..”? Does it really only return the rows matching the predicate or does it return all blocks containing rows which match the predicate? If the former is correct, how is this handled in the db block buffer?

A. Actually, that statement is a bit too liberal. Oracle Exadata Storage Server Smart Scans return only the cited columns from only the filtered rows, not entire rows as that statement may suggest. Results from a Smart Scan are not “real” database blocks thus the results do not get cached in the SGA. Consider the following business question:

List our Club Card members that have spent more than $1,000.00 at non-partner retail merchants in the last 180 days with our affinity credit card. Consider only non-partner merchants within 10 miles radius of one of our stores.

The SQL statement in the following text box answers this question (assuming 10-mile wide US postal code zones):

select    cf.custid, sum(act.purchase_amt) salesfrom all_card_trans act, cust_fact cf

where ( act.card_no like '4777%' or act.card_no  like '3333%' )

and act.card_no = cf.aff_cc_num and cf.club_card_num not like '0%'

and act.purchase_dt  >   to_date('07-MAR-2008','dd-mon-yyyy')

and act.merchant_zip in ( select distinct(zip) from our_stores)

and act.merchant_code not in (select merchant_code from partner_merchants)

group by custid having sum(act.purchase_amt) > 1000 ;

Intelligent Storage. Filtering and Projecting. Oh, and a Bloom Filter Too.

Now, let’s look at the predicate handling in the plan. The next text box shows that storage is filtering based on club cards, purchase dates and credit card. Given the way this query functions, all_card_trans is essentially like a fact table and cust_fact is more of a dimension table since there are 1,300 fold more rows in all_card_trans than cust_fact. You’ll also see there was a bloom filter pushed to the Exadata Storage Server cells used to join the filtered cust_fact rows to the filtered all_card_trans rows.

Predicate Information (identified by operation   id):---------------------------------------------------

3 -   filter(SUM("ACT"."PURCHASE_AMT")>1000)

7 -   access("ACT"."MERCHANT_CODE"="MERCHANT_CODE")

10 -   access("ACT"."MERCHANT_ZIP"="ZIP")

15 -   access("ACT"."CARD_NO"="CF"."AFF_CC_NUM")

20 -   storage("CF"."CLUB_CARD_NUM" NOT LIKE '0%')

filter("CF"."CLUB_CARD_NUM" NOT LIKE   '0%')

25 -   storage("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07   00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO"   LIKE '4777%' OR

"ACT"."CARD_NO" LIKE '3333%'))

filter("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07   00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO"   LIKE '4777%' OR

"ACT"."CARD_NO" LIKE '3333%') AND   SYS_OP_BLOOM_FILTER(:BF0000,"ACT"."CARD_NO"))

This example should make the answer a bit clearer. Storage is ripping through the rows, yes, but only returning named columns and heavily filtered data.

The following text box shows frozen output of a tool we used to monitor I/O at a high level across cells. The names of the Exadata Storage cells in this case are called sgbeta1s01-sgbeta1s06 (I’m like you, I have to scrounge for hardware). These are 2 3-second averages showing I/O in excess of 1,000,000 KB per cell or an aggregate of 6 GB/s (see the data under the “bi” column). This is, of course, what we’ve been referring to as a “half-rack” in spite of the fact that it only has 6 Exadata Storage Server cells instead of the 7 that would be a true half-rack. Nonetheless, this is a 4-table query doing real work and, as you can see in the last text box, I’m connected to a single instance of RAC-on a dual socket/quad-core server! Without Exadata, a server would have to have 15 fibre channel host bus adaptors to drive this query at this I/O rate and a great deal of CPU power. That would be a very large server-and it would still be slower.

Kids, don’t try this at home without supervision

procs -----------memory---------- ---swap-- -----io----  --system-- -----cpu------15:29:43:   r  b     swpd    free   buff     cache   si   so        bi    bo    in      cs us sy id wa st

sgbeta1s01: 4    0    204   63292 134928 1523732    0      0 1033237    24  5062 37992 25  3 72  0  0

sgbeta1s02: 8    0    204   65708 141800 1596124    0      0 1038949     9  5238 37869 25  3 72  0  0

sgbeta1s03: 4    0    208 1183780  57832    584988    0    0 1049712    35  5862 40258 26  3 69    2  0

sgbeta1s04: 2    0    204 1254456  79084    568052    0    0 1053093    13  5329 38462 25  3 72    0  0

sgbeta1s05: 4    0      0  144924 122252 1453324    0      0 1022032    29  4991 38164 25  3 72  0  0

sgbeta1s06:11    0      0   95808 124200 1418236    0      0 1000635    24  4767 38249 25  2 73  0  0

Minimum:   2  0      0   63292  57832    568052    0    0 1000635     9  4767 37869 25  2 69    0  0

Maximum:11  0    208 1254456 141800 1596124    0      0 1053093    35  5862 40258 26  3 73  2  0

Average:   5  0    136  467994 110016   1190742    0    0 1032943    22  5208 38499 25  2 71  0  0

procs -----------memory---------- ---swap-- -----io----  --system-- -----cpu------

15:29:46:   r  b     swpd    free   buff     cache   si   so        bi    bo    in      cs us sy id wa st

sgbeta1s01: 3    0    204   63012 134928 1523732    0      0 1005643    24  5099 38475 25  2 72  0  0

sgbeta1s02: 2    0    204   65188 141800 1596124    0      0 1011285     3  4867 37671 25  2 73  0  0

sgbeta1s03: 4    0    208 1184408  57832    584988    0    0 1005371    20  5775 39378 25  3 72    0  0

sgbeta1s04: 3    0    204 1255200  79084    568052    0    0    993264    13  4891 38240 24  3 73  0  0

sgbeta1s05: 5    0      0  144320 122256 1453324    0      0 1016859    29  4947 38073 25  3 72  0  0

sgbeta1s06: 6    0      0   95824 124200 1418236    0    0 1019323    24  4961 38224 25  2 73  0  0

Minimum:   2  0      0   63012  57832    568052    0    0    993264     3  4867 37671 24  2 72  0  0

Maximum:   6  0    208 1255200 141800 1596124      0    0 1019323    29    5775 39378 25  3 73  0    0

Average:   3  0    136  467992 110016   1190742    0    0 1008624    18  5090 38343 24  2 72  0  0

[oracle@sgbeta1c01]$ sqlpus ‘/ as sysdba'SQL*Plus: Release 11.1.0.7.0 - Production on Thu   Oct 2 15:35:55 2008

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

Connected to:

Oracle Database 11g Enterprise Edition Release   11.1.0.7.0 - 64bit Production

With the Partitioning, Real Application Clusters,   OLAP, Data Mining

and Real Application Testing options

SQL> select instance_name from gv$instance ;

INSTANCE_NAME

----------------

test1

SQL>

Oracle Exadata Storage Server. Frequently Asked Questions. Part III.

Published October 1, 2008 oracle 7 Comments

This is installment number three in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit:

Exadata Storage Server Frequently Asked Questions Part I.

Exadata Storage Server Frequently Asked Questions Part II.

Q. Is there a coming Upgrading Guide kind of document or a step-by-step installation metalink note planned for the database machine?
A. HP Oracle Database Machine and Oracle Exadata Storage Servers are installed at the factory by HP.

Q The ODM spec sheets says there are four 24-port InfiniBand switches (96 total ports) in each DB machine. If each of the 8 RBMS hosts has 2 links to the switches and each Exadata server (14) also has two, then it is just 2×8 + 2 x 14= 44 links
A. Since the HP Oracle Database Machine is an appliance installed at the factory, by HP, I’m hesitant to go too deep in this area. The short answer is that the extra switches are there to address the loss of an entire HP Oracle Database Machine rack in a multi-rack scale-out configuration.

Q. How does this architecture deal with data distribution and redistribution? It seems like that’s still going to be a problem with joining data that isn’t distributed the same way. Does all the data then go back to the RAC?
A. Data distribution is a multifaceted topic. There is partition-wise data distribution and ASM extent distribution. Nonetheless, the answer is the same for both types of distribution: no change. ASM treats what we refer to as “grid disks” in Exadata Storage Cells no differently than disks in a SAN when it comes to laying out extents. Likewise, partitioning does not change. In fact, nothing about partitioning changes with Exadata.

If, for instance, you have data with poor data distribution (e.g., partitioning skew) with ASM on a SAN, it would be the same with Exadata-but at least the I/O would be extremely fast <smiley>

Exadata changes how data comes from disk, not how it is placed on disk.

Q. If I do a big query and sort, will that bottleneck one of the RAC nodes?
A. Exadata changes nothing about Oracle in this regard. Nonetheless, sorts are parallelized with Intra-node Parallel Query in a Real Applications Clusters environment so I’m at a loss for what you are referring to.

Q. Is temp space managed at the storage layer or on the RAC nodes?
A. Exadata changes nothing about Oracle in this regard. Temporary segments are a logical collection of extents in a file. It’s the same with or without Exadata.

Q. Not sure this is exactly in your field, but what does the cell do when it hits a block that was flushed to disk before being committed (ie needs an UNDO check)? Can it return a mix of rowset and block data so the DB server checks UNDO ?
A. Data consistency and transaction concurrency are not offloaded to Exadata Storage Servers. The integrity of the scan is controlled by the RDBMS. I think it is counterproductive to discuss the edge-cases where a Smart Scan will not be possible. If you are using a database as a data warehouse, you will get Smart Scans. If you are doing reporting against an active OLTP database, you will see queries that are not serviced by Smart Scans.

Smart Scans are optimized for data warehousing workloads and, just as is the case in non-Exadata environments, it is not good practice to be modifying the active query data set while running queries. Adding partitions and loading data while queries are running, sure, but changing a row here and there in a data warehouse doesn’t make much sense (at least to me).

Q. Suppose I have a server (e.g. linux) or a number of RAC nodes, how do I connect them to the Exadata and how do I access the disk space?
A. If you wish to adopt Exadata into an existing Oracle Database 11.1.0.7 environment there are SKUs for that. Talk to your sales representative and make room for Infiniband switches and HCAs.

Q. Do I need fibre or ethernet connections, switches, special hardware between my server and the Exadata?
A. Of course! Exadata is Infiniband based. You’ll at least need Infiniband HCAs and switches to get to the data stored in Exadata. Once you are up to the correct Oracle version you can run with non-Exadata and Exadata tablespaces side-by-side. This fact will aid migrations.

Q. Do I still need OS multipath software?
A. No. Well, not for Exadata.

Q. Do I see raw luns that I present to an ASM instance running on my own machine(s) or does my database communicates directly with the ASM on the Exadata?
A. ASM will have visibility to Exadata Storage Server “grid disks.” There happens to be a command line tool that makes it easy for me to illustrate the point. In the following text box I’ve cut and pasted session output from an xterm where I used the kfod tool to list all known Exadata Storage Server grid disks and grep’ed for ones I named “data1” on cell number 6 of the configuration. To further illustrate the point I then changed directories to list the DDL I used to incorporate all “data1” grid disks in the configuration into an ASM disk group called “DATA1.” Other than the fact that DATA1 is a candidate for Smart Scan, there is really nothing different between this disk group and any other Oracle Database 11g ASM disk group.

$ kfod -disk all | grep ‘data1.*cell06’

241: 117760 Mb o/192.168.50.32:5042/data1_CD_10_cell06 <unknown> <unknown>

242: 117760 Mb o/192.168.50.32:5042/data1_CD_11_cell06 <unknown> <unknown>

243: 117760 Mb o/192.168.50.32:5042/data1_CD_12_cell06 <unknown> <unknown>

244: 117760 Mb o/192.168.50.32:5042/data1_CD_1_cell06 <unknown> <unknown>

245: 117760 Mb o/192.168.50.32:5042/data1_CD_2_cell06 <unknown> <unknown>

246: 117760 Mb o/192.168.50.32:5042/data1_CD_3_cell06 <unknown> <unknown>

247: 117760 Mb o/192.168.50.32:5042/data1_CD_4_cell06 <unknown> <unknown>

248: 117760 Mb o/192.168.50.32:5042/data1_CD_5_cell06 <unknown> <unknown>

249: 117760 Mb o/192.168.50.32:5042/data1_CD_6_cell06 <unknown> <unknown>

250: 117760 Mb o/192.168.50.32:5042/data1_CD_7_cell06 <unknown> <unknown>

251: 117760 Mb o/192.168.50.32:5042/data1_CD_8_cell06 <unknown> <unknown>

252: 117760 Mb o/192.168.50.32:5042/data1_CD_9_cell06 <unknown> <unknown>

$ cd $ORACLE_HOME/dbs

$ cat cr_data1_dg.sql

create diskgroup DATA1 normal redundancy

DISK ‘o/*/*data1*’

ATTRIBUTE

‘AU_SIZE’ = ‘4M’,

‘CELL.SMART_SCAN_CAPABLE’=’TRUE’,

‘compatible.rdbms’=’11.1.0.7’,

‘compatible.asm’=’11.1.0.7’

Q. Do I still have to struggle with raw devices on os level?
A. No.

Q. Can I create multiple databases in the available space?
A. Absolutely. I haven’t even started blogging about I/O Resource Management features of Exadata. This is the only platform that can prevent multiple applications from stealing resources from each other-all the way down to physical I/O.

Q. Do I still need to create asm disks or diskgroups, or do I just see one large asm disk of e.g. 168Tb?
A. Physical disks in Exadata Storage Server cells are “carved” up into what we refer to as grid disks. Each grid disk becomes an ASM disk. The fewest ASM disks you could end up with in a full-rack HP Oracle Database Machine is 168.

Q. […] don’t some of the DW vendors split the data up in a shared nothing method. Thus when the data has to be repartitioned it gets expensive. Whereas here you just add another cell and ASM goes to work in the background. (depending upon the ASM power level you set.)
A. All the DW Appliance vendors implement shared-nothing so, yes, the data is chopped up into physical partitions. If you add hardware to increase performance of queries against your current dataset the data will have to be reloaded into the new partitioning scheme. As has always been the case with ASM, adding new disks-and therefore Exadata Storage Server cells-will cause the existing data to be redistributed automatically over all (including the new) drives. This ASM data redistribution is an online function.

Q. [regarding] Supportability – Oracle software support has always been spotty. Now with a combination of Oracle Linux, Oracle database and HP hardware, it is going to be interesting to see how it all comes together – especially upgrades, patches etc.
A. Support is provided via a single phone number.

Q. How easy or difficult is it to maintain? Do we need to build specialized skills inhouse or is it hands-off like Teradata?
A. In my reckoning, you need to the same Oracle data warehousing skills you need to day, plus a primer on Exadata.

Q. [regarding] Ease of use – Can I simply move an existing oracle warehouse instance to the new database machine and can use it day 1? How easy or difficult is it? Do I need to spend significant time like with a RAC instance – partitioning etc?
A. Data from an existing data warehouse will have to be physically moved into an Exadata environment. You will be either moving entirely from one environment to another (e.g., 10g on Unix to Exadata with Linux) or adding Exadata Storage to your existing environment and copying the data into Exadata storage. The former would be done in the same manner as any cross-platform migration while the latter would require the warehouse be upgraded to Oracle Database 11g Release 11.1.0.7. Once upgraded to 11.1.0.7 and Infiniband connectivity is sorted out, the data can then copied with the simplicity of a CTAS operation or other such operation.

Q.The Exadata storage concept is excellent – more storage comes with additional CPU and Cache – Can we use it for non-oracle applications – such as Log processing etc?
A. Anything that can go into an Oracle Database can go into Exadata. So such features as SecureFiles are supported. Exadata is not scalable general-purpose storage.

Q. Why would I want to use Oracle rather than Teradata or Netezza which is proven?
A. Because, perhaps, the data you are extracting to load into Netezza is coming from an Oracle Database? There are a lot of answers to this question I suppose. In the end, I should think the choice would be based foremost on performance. Most of Netezza’s customers are either Oracle customers as well, or have migrated from Oracle. I think in Netezza’s “early days” the question was likely reversed. We aim to reverse the question.

Q. Backup using RMAN – RMAN backups are not really geared for big databases, so is there any other off host alternatives available?
A. The data stored in Exadata is under ASM control. The same backup restrictions apply for Exadata as any other ASM deployment.

Oracle Exadata Storage Server Related Web News Media and Blog Errata. Part I.

Published September 29, 2008 oracle 3 Comments

This is just a quick blog entry to correct some minor (and some not-so-minor) errors I’ve stumbled upon in blog and Web news media.

Exadata Storage Server Gross Capacity

In his recent Computerworld article, Eric Lai was trying to put some flesh on the bones as it were. It is a good article, but a few bits need correction/clarification. First, Eric wrote:

The Exadata Storage Server, a standard rack-mountable HP Proliant DL180 G5 server sporting two Intel quad-core CPUs connected to 12 hard drives of 1TB each.

The HP Oracle Database Machine has 2 hard drive options-SAS and SATA. The SAS option is comprised of 12 300GB 15,000 RPM drives. The SATA option is indeed 12 drives each 1TB in size.

HP Oracle Database Machine Communications Bandwidth

Later in the article the topic of bandwidth between the RDBMS hosts and Exadata Storage Servers was covered. The author wrote:

[…] users can expect a real-world bandwidth today of 1Gbit/sec, which he claimed is far faster than conventional disk storage arrays.

The HP Oracle Database Machine has 1 active and one failover Infiniband path between each of the RDBMS hosts (for RAC inter-process communication) and from each RDBMS host to each Exadata Storage Server. Each path offers 20Gb bandwidth which is more aggregate streaming disk I/O than a Exadata cell can produce for up-stream delivery as I explain in question #2 two of my Exadata Storage Server FAQ. Since the disks can “only” generate roughly 1GB of streaming data, we routinely state that users can expect real-world bandwidth today of 1GB/second. Note the difference in notation (e.g., GB vs Gb) because it accounts for nearly one order of magnitude difference.

A Rack Half-Full or Half-Empty?

When discussing the Beta testing activity, the author quoted Larry Ellison as having said the following at his Keynote address:

A number of Oracle customers have been testing the Machine for a year, putting their actual production workloads onto half-sized Oracle Database Machines, because we’re really cheap

I was there and he did say it. While I may be dumb (actually I can talk), I am not stupid enough to “correct” Larry Ellison. Nonetheless, when Larry said the Beta participants were delivered a half-sized HP Oracle Database Machine he was actually being too generous. He said we sent a half configuration because “we’re really cheap”, but in fact we must be even cheaper because, while we sent them half the number of RDBMS hosts, we sent them 6 Exadata Storage Servers as opposed to 7, which would be exactly half a Database Machine.

Good for the Goose, Good for the Gander

Finally, on my very own blog (in this post even!) I have routinely stated the wire bandwidth of the Infiniband network with which we interconnect Oracle Database instances and Oracle Database instances to Oracle Exadata Storage Server cells as being 20Gb/s. Of course with all communications protocols there is a difference between the wire-rate and the payload bandwidth. One of my blog readers commented as follows:

Yet another minor nit As I commented elsewhere, 20Gb/s is the IB baud rate, the useful bit rate is 16Gb/s (8b/10b encoding). I am not sure why the IB folks keep using the baud numbers.

And to that I replied:

Not minor nits at all. Thanks. We have used pretty poor precision when it comes to this topic. Let me offer mitigating circumstances.

While the pipe is indeed limited to 16Gb payload (20% less than our routinely stated 20Gb), that is still nearly twice the amount of data a cell can produce by streaming I/O from disk in the first place. So, shame on us for being 20% off in that regard, but kudos to us for making the pipes nearly 100% too big?

HP Oracle Database Machine. A Thing of Beauty Capable of “Real Throughput!”

Published September 27, 2008 oracle 3 Comments

As they say, a blog without photographs is simply boring. Here is a picture of a single-rack HP Oracle Database Machine. It is stuffed with 8 nodes for Real Application Clusters and 14 Oracle Exadata Storage Servers with 168 3.5″ SAS hard drives. My lab work on a SAS-version of one just like this yields 13.6 GB/s throughput for table scans with offloaded filtration and column projection.

The next photo is a shot of (from the left) Mike Hallas, Greg Rahn and myself in the Moscone North Demo of the HP Oracle Database Machine. Mike and Greg are in the Oracle Real-World Performance Group. Great guys!

Real Throughput or Effective Throughput?

Mike worked (jointly with Bob Carlin) on the latest scale-out Proof of Concept that drove a mulit-rack HP Oracle Database Machine to 70 GB/s scanning tables with 4.28:1 compression—or in terms used more commonly by The Competition™—299.6 GB/s. Of course 300 GB/s is the effective scan rate, but be aware that The Competition™ often times expresses their throughput using their effective throughput. I don’t play that game. I’ll say if it is throughput or effective throughput. Wordy, I know, but I’m not as short-winded as The Competition™ it seems.

I’ve blogged about the Real-World Performance Group (under the esteemed Andrew Holdsworth) before. Those guys are awesome! Come to think of it, I have to bestow the “A” word on the MAA team as well in spite of the fact that Mike Nowak was “too busy” to catch a beer with me during the entire OW week. That’s weak! 🙂

Oracle Exadata Storage Server. Frequently Asked Questions. Part II

Published September 26, 2008 oracle 29 Comments

This is installment number two in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit Exadata Storage Server Frequently Asked Questions Part I. I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-wording them as questions.

Later today The Pythian Group will be conducting a podcast question and answer interview with me that will be available on their website shortly thereafter. I’ll post a link to that when it is available.

Questions and Answers

Q. [I’m] willing to bet this is a full-blown Oracle instance running on each Exabyte [sic] Storage Server.

A. No, bad bet. Exadata Storage Server Software is not an Oracle Database instance. I happened to have an xterm with a shell process sitting in a directory with the self-extracting binary distribution file in it. We can tell by the size of the file that there is no room for a full-blown Oracle Database distribution:

$ ls -l cell*

-rwxr-xr-x 1 root root 206411729 Sep 12 22:04 cell-080905-1-rpm.bin

Q. This must certainly be a difficult product to install, right?

A. HP installs the software on their manufacturing floor. Nonetheless I’ll point out that installing Oracle Exadata Storage Server Software is a single execution of the binary distribution file without options or arguments. Further, initializing a Cell is a single command with two options of which only one requires an argument; such as the following example where I specify a bonded Infiniband interface for interconnect 1:

# cellcli

CellCLI: Release 11.1.3.0.0 – Production on Fri Sep 26 10:56:17 PDT 2008

Cell Efficiency Ratio: 10,956.8

CellCLI> create cell cell01 interconnect1=bond0

After this command completes I’ve got a valid cell. There are no preparatory commands (e.g., disk partitioning, volume management, etc).

Q. I’m trying to grasp whether this is really just being pitched at the BI and data warehouse space, or whether it has real value in the OLTP space as well.

A. Oracle Exadata Storage Server is the best block server for Oracle Database, bar none. That being said, in the current release, Exadata Storage Server is in fact optimized for DW/BI workloads, not OLTP.

Q. I know we shouldn’t set too much store in these things, but are there plans to submit TPC benchmarks?

A. You are right that there should not be as much stock placed in TPC benchmarks, but they are a necessary evil. I don’t work in that space, but could you imagine Oracle not doing some audited benchmarks? Seems unlikely to me.

On the topic of TPC benchmarks, I was taking a gander at the latest move in the TPC-C “arms race.” This IBM RS600 p595 result of 6,085,166 TpmC proves that the TPC-C is not (has never been) an I/O efficiency benchmark. If you throw more gear at it, you get a bigger number! Great!

How about a stroll down memory lane.

When I was in Database Engineering in Sequent Computer Systems back in 1998, Sequent published a world-record Oracle TPC-C result on our NUMA system. We achieved 93,901 TpmC using 64GB main memory. The 6,085,166 IBM number I just cited used 4TB main memory. So how fulfilling do you think it must be to do that much work on a TPC-C just to prove that in 10 years nothing has changed! The Sequent result comes in at 1 TpmC per 714KB main memory and the IBM result at 1 TpmC per 705KB main memory. Now that’s what I want to do for a living! Build a system with 10,992 disk drives and tons of other kit just to beat a 10-year-old result by 1.3%. Yes, we are now totally convinced that if you throw more memory at the workload you get a bigger number! In the words of Gomer Pyle, “Soo-prise, Soo-prise, Soo-prise.” Ok, enough of that, I don’t like arms-race benchmarks.

Q. From the Oracle Exadata white paper: “No cell-to-cell communication is ever done or required in an Exadata configuration.”and a few paragraphs later: “Data is mirrored across cells to ensure that the failure of a cell will not cause loss of data, or inhibit data accessibility” Can both these statements be true and would we need to purchase a minimum of two cells for a small-ish ASM environment?

A. Cells are entirely autonomous and the two statements are true indeed. Consider two ASM disks out in a Fibre Channel SAN. Of course we know those two disks are not “aware of each other” just because ASM is using blocks from each to perform mirroring. The same is true for Oracle Exadata Storage Server cells and the drives housed inside them. As for the second part of the questions, yes, you must have a minimum of two cells. In spite of the fact that Cells are shared nothing (unaware of each other), ASM is in fact Cell-aware. ASM is intelligent enough to not mirror between 2 drives in the same Cell.

Q. Can this secret sauce help with write speeds?

A. That depends. If you have a workload suffering from the loss of processor cycles associated with standard Unix/Linux I/O libraries then, sure. If you have an application that uses storage provisioned from an overburdened back-end Fibre Channel disk loop (due to application collusion) then, sure. Strictly speaking, the “secret sauce” is the Oracle Exadata Storage Server Software and it does not have any features for write acceleration. Any benefit would have to come from the fact that the I/O pipes are ridiculously fast and the I/O protocol is ridiculously lightweight and the system on a whole is naturally balanced. I’ll blog about the I/O Resource Management (IORM) feature of Exadata soon as I feel it has positive attributes that will help OLTP applications. Although it is not an acceleration feature, it eliminates situations where applications steal storage bandwidth from each other.

Q. I like your initial overview of the product, but I believe that you need to compare both Netezza and Exadata side by side in real-world scenarios to gauge their performance.

A. I partially agree. I cannot go and buy a Netezza and legally produce competitive benchmark results based on the gear. Just read any EULA for any storage management software and you’ll see the bold print. Now that doesn’t mean Oracle’s competitors don’t do that. I think the comparison will come in the form of reduced Netezza sales. Heaven knows the 16% drop in Netezza stock was not as brutal as I expected.

Q. Re. [your] comparison to Netezza [in your first Exadata related post]. It’s bit of apple to oranges, really. You assume 80MB/s per disk for Exadata and for some reason only 70MB/s per disk for Netezza. Also, you have 168 disks spinning in parallel on Exadata and 112 on Netezza. Had your assumptions been tha same, sequential IO throughput would be similar, at least theoretically.

A. Reader, I invite you to explain to us how you think native SATA 7,200 RPM disk drives are going to match 15K RPM SAS drives. When I put 70 MB/s into the equation I was giving quite a benefit of the doubt (as if I’ve never measured SATA performance). Please, if you have a Netezza let us know how much streaming I/O you get from a 7,200 RPM SATA drive once you read beyond the first few outside sectors. I have also been using the more conservative 80 MB/s for our SAS drives. I’m highballing SATA and low-balling SAS. That sounds fair to me. As for the comparison between the numbers of drives, well, Netezza packaging limits the drive (SPU) count to 112 per cabinet. It would suit me fine if it takes a 1 plus another half rack to match a single HP Oracle Database Machine. That empty half of the rack would be annoying from a space constraint point of view though. Nonetheless, if you did go with a rack and a half (168 SPU), would that somehow cancel out the base difference in drive performance between SATA and SAS?

Oracle Exadata Storage Server Version 1. A “FAQ” is Born. Part I.

Published September 26, 2008 I/O Topics , Oracle I/O Performance , Oracle11g , Real Application Clusters 16 Comments
Tags: Exadata, HP Oracle Database Machine, Netezza, Oracle Exadata Storage Server, Oracle Programmable Storage Server

BLOG UPDATE (22-MAR-10): Readers, please be aware that this blog entry is about the HP Oracle Database Machine (V1).

BLOG UPDATE (01-JUN-09). According to my blog statistics, a good number of new readers find my blog by being referred to this page by google. I’d like to draw new readers’ attention to the sidebar at the right where there are pages dedicated to indexing my Exadata-related posts. The original blog post follows:

I expected Oracle Exadata Storage Server to make an instant splash, but the blogosphere has really taken off like a rocket with the topic. Unfortunately there is already quite a bit of misinformation out there. I’d like to approach this with routine quasi-frequently asked question posts. When I find misinformation, I’ll make a blog update. So consider this installment number one.

Q. What does the word programmable mean in the product name Exadata Programmable Storage Server?

A. I don’t know, but it certainly has nothing to do with Oracle Exadata Storage Server. I have seen this moniker misapplied to the product. An Exadata Storage Server “Cell”-as we call them-is no more programmable than a Fibre Channel SAN or NAS Filer. Well, it is of course to the Exadata product development organization, but there is nothing programmable for the field. I think, perhaps, someone may have thought that Exadata is a field programmable gate array (FPGA) approach to solving the problem of offloading query intelligence to storage. Exadata is not field-“programmable” and it doesn’t use or need FPGA technology.

Q. How can Exadata be so powerful if there is only a single 1gb path from the storage cells to the switch?

A. I saw this on a blog post today and it is an incorrect assertion. In fact, I saw a blogger state, “1gb/s???? that’s not that good.” I couldn’t agree more. This is just a common notation blunder. There is, in fact, 20 Gb bandwidth between each Cell and each host in the database grid, which is close to 2 gigabytes of bandwidth (maximum theoretical 1850MB/s due to the IB cards though). I should point out that none of the physical plumbing is “secret-sauce.” Exadata leverages commodity components and open standards (e.g., OFED ).

Q. How does Exadata change the SGA caching dynamic?

A. It doesn’t. Everything that is cached today in the SGA will still be cached. Most Exadata reads are buffered in the PGA since the plan is generally a full scan. That is not to say that there is no Exadata value for indexes, because there can be. Exadata scans indexes and tables with the same I/O dynamic.

Q. This Exadata stuff must be based on NAND FLASH Solid State Disk

A. No, it isn’t and I won’t talk about futures. Exadata doesn’t really need Solid State Disk. Let’s think this one through. Large sequential read and write speed is about the same on FLASH SSD as rotating media, but random I/O is very fast. 12 Hard Disk Drives can saturate the I/O controller so plugging SSD in where the 3.5″ HDDs are would be a waste.

Q. Why mention sequential disk I/O performance since sequential accesses will only occur in rare circumstances (e.g., non-concurrent scans).

A. Yes, and the question is what? No, honestly. I’ll touch on this. Of course concurrent queries attacking the same physical disks will introduce seek times and rotational delays. And the “competition” can somehow magically scan different table extents on the same disks without causing the same drive dynamic? Of course not. If Exadata is servicing concurrent queries that attack different regions of the same drives then, yes, by all means there will be seeks. Those seek, by the way, are followed by 4 sequential 1MB I/O operations so the seek time is essentailly amortized out.

Q. Is Exadata I/O really sequential, ever?

A. I get this one a lot and it generally comes from folks that know Automatic Storage Management (ASM). Exadata leverages ASM normal redundancy mirroring which mirrors and stripes the data. Oh my, doesn’t that entail textbook random I/O? No, not really. ASM will “fill” a disk from the “outside-in. ” This does not create a totally random I/O pattern since this placement doesn’t randomize from the outer edge of the platters to the spindle and back. In general, the “next” read on any given disk involved in a scan will be at a greater offset in the physical device and not that “far” from the previous sectors read. This does not create the pathological seek times that would be associated with a true random I/O profile.

When Exadata is scanning a disk that is part of an ASM normal redundancy disk group and needs to “advance forward” to get the next portion of the table, Exadata directs the drive mechanics to position at the specific offset where it will read an ASM allocation unit of data, and on and on it goes. Head movements of this variety are considered “short seeks.” I know what the competition says about this topic in their positioning papers. Misinformation will be propagated.

Let me see if I can handle this topic in a different manner. If HP Oracle Exadata Storage Server was a totally random I/O train wreck then it wouldn’t likely be able to drive all the disks in the system at ~85MB/s. In the end, I personally think the demonstrated throughput is more interesting than an academic argument one might stumble upon in an anti-Exadata positioning paper.

Well, I think I’ll wrap this up as installment one of an on-going thread of Q&A on HP Oracle Exadata Storage Server and the HP Oracle Database Machine.

Don’t forget to read Ron Weiss’ Oracle Exadata Storage Server Technical Product Whitepaper. Ron is a good guy and it is a very informative piece. Consider it required reading-especially if you are trolling my site in the role of competitive technical marketing. <smiley>

Oracle Exadata Storage Server. Part II.

Published September 25, 2008 oracle 5 Comments

I have to run over and man a live single-rack HP Oracle Database Machine Demonstration in Moscone North, so I thought I’d take just a moment to post some links to more official Oracle information on Oracle Exadata Storage Server:

Main Oracle Exadata Storage Server Webpage

Oracle Exadata Storage Server Product Whitepaper

I plan to start a FAQ-style series of blog posts regarding the HP Oracle Database Machine and Oracle Exadata Storage Server as soonas possible.

Yesterday was a big day for me, and the extremely talented team I work with. Having the pleasure of doing performance architecture work on Oracle’s most important new product in ages (my humble opinion) has been quite an adventure. One of the last Oracle Exadata Storage Server tasks I worked on prior to the release was a Proof of Concept for Winter Corporation. Expect the report from that work to be available in the next few days. I’ll post a link to that when it is ready.

Oracle Exadata Storage Server. Part I.

Published September 24, 2008 oracle 37 Comments
Tags: column projection, columnar database, Exadata, Exadata Storage Grid, HP Oracle Database Machine, Netezza Oracle, Oracle Exadata Storage Server Software, predicate offload processing, Teradata

Brute Force with Brains.
Here is a brief overview of the Oracle Exadata Storage Server key performance attributes:

Intelligent Storage. Ship less data due to query intelligence in the storage.
Bigger Pipes. Infiniband with Remote Direct Memory Access. 5x Faster than Fibre Channel.
More Pipes. Scalable, redundant I/O Fabric.

Yes, it’s called Oracle Exadata Storage Server and it really was worth the wait. I know it is going to take a while for the message to settle in, but I would like to take my first blog post on the topic of Oracle Exadata Storage Server to reiterate the primary value propositions of the solution.

Exadata is fully optimized disk I/O. Full stop! For far too long, it has been too difficult to configure ample I/O bandwidth for Oracle, and far too difficult to configure storage so that the physical disk accesses are sequential.

Exadata is intelligent storage. For far too long, Oracle Database has had to ingest full blocks of data from disk for query processing, wasting precious host processor cycles to discard the uninteresting data (predicate filtering and column projection).

Oracle Exadata Storage Server is Brute Force. A Brawny solution.
A single rack of the HP Oracle Database Machine (based on Oracle Exadata Storage Server Software) is configured with 14 Oracle Exadata Storage Server “Cells” each with 12 3.5″ hard drives for a total of 168 disks. There are 300GB SAS and 1TB SATA options. The database tier of the single-rack HP Oracle Database Machine consists of 8 Proliant DL360 servers with 2 Xeon 54XX quad-core processors and 32 GB RAM running Oracle Real Application Clusters (RAC). The RAC nodes are interconnected with Infiniband using the very lightweight Reliable Datagram Sockets (RDS) protocol. RDS over Infiniband is also the I/O fabric between the RAC nodes and the Storage Cells. With the SAS storage option, the HP Oracle Database Machine offers roughly 1 terabyte of optimal user addressable space per Storage Cell-14 TB total.

Sequential I/O
Exadata I/O is a blend of random seeks followed by a series of large transfer requests so scanning disk at rates of nearly 85 MB/s per disk drive (1000 MB/s per Storage Cell) is easily achieved. With 14 Exadata Storage Cells, the data-scanning rate is 14 GB/s. Yes, roughly 80 seconds to scan a terabyte-and that is with the base HP Oracle Database Machine configuration. Oracle Exadata Storage Software offers these scan rates on both tables and indexes and partitioning is, of course, fully supported-as is compression.

Comparison to “Old School”
Let me put Oracle Exadata Storage Server performance into perspective by drawing a comparison to Fibre Channel SAN technology. The building block of all native Fibre Channel SAN arrays is the Fibre Channel Arbitrated Loop (FCAL) to which the disk drives are connected. Some arrays support as few as 2 of these “back-end” loops, larger arrays support as many as 64. Most, if not all, current SAN arrays support 4 Gb FCAL back-end loops which are limited to no more than 400MB/s of read bandwidth. The drives connected to the loops have front-end Fibre Channel electronics and-forgetting FC-SATA drives for a moment-the drives themselves are fundamentally the same as SAS drives-given the same capacity and rotational speed. It turns out that SAS and Fibre drives, of the 300GB 15K RPM variety, perform pretty much the same for large sequential I/O. Given the bandwidth of the drives, the task of building a SAN-based system that isn’t loop-bottlenecked requires limiting the number of drives per loop to 5 (or 10 for mirroring overhead). So, to match a single rack configuration of the HP Oracle Database Machine with a SAN solution would require about 35 back-end drive loops! All of this math boils down to one thing: a very, very large high-end SAN array.

Choices, Choices: Either the Largest SAN Array or the Smallest HP Oracle Database Machine
Only the largest of the high-end SAN arrays can match the base HP Oracle Database Machine I/O bandwidth. And this is provided the SAN array processors can actually pass through all the I/O generated from a full complement of back-end FCAL loops. Generally speaking, they just don’t have enough array processor bandwidth to do so.

Comparison to the “New Guys on the Block”
Well, they aren’t really that new. I’m talking about Netezza. Their smallest full rack has 112 Snippet Processing Units (SPU) each with a single SATA disk drive-and onboard processor and FPGA components-for a total user addressable space of 12.5 TB. If the data streamed off the SATA drives at, say, 70 MB/s, the solution offers 7.8 GB/s-42% slower than a single-rack HP Oracle Database Machine.

Big, Efficient Pipes
Oracle Exadata Storage Server delivers I/O results directly into the address space of the Oracle Database Parallel Query Option processes using the Reliable Datagram Sockets (RDS) protocol over Infiniband. As such, each of the Oracle Real Application Clusters nodes are able to ingest a little over a gigabyte of streaming data per second at a CPU cost of less than 5%, which is less than the typical cost of interfacing with Fibre Channel host-bus adaptors via traditional Unix/Linux I/O calls. With Oracle Exadata Storage Server, the Oracle Database host processing power is neither wasted on filtering out uninteresting data, nor plucking out columns from the rows. There would, of course, be no need to project in a colum-oriented database but Oracle Database is still row-oriented.

Oracle Exadata Storage Server is Intelligence Storage. Brainy Software.
Oracle Exadata Storage Server truly is an optimized way to stream data to Oracle Database. However, none of the traditional Oracle Database features (e.g., partitioning, indexing, compression, Backup/Restore, Disaster Protection, etc) are lost when deploying Exadata. Combining data elimination (via partitioning) with compression further exploits the core architectural strengths of Exadata. But what about this intelligence? Well, as we all know, queries don’t join all the columns and few queries ever run without a WHERE predicate for filtration. With Exadata that intelligence is offloaded to storage. Exadata Storage Cells execute intelligent software that understands how to perform filtration as well as column projection. For instance, consider a query that cites 2 columns nestled in the middle of a 100-column row and the WHERE predicate filters out 50% of the rows. With Exadata, that is exactly what is returned to the Oracle Parallel Query processes.

By this time it should start to make sense why I have blogged in the past the way I do about SAN technology, such as this post about SAN disk/array bottlenecking. Configuring a high-bandwidth SAN requires a lot of care.

Yes, this is a very short, technically-light blog entry about Oracle Exadata Storage Server, but this is day one. I didn’t touch on any of the other really exciting things Exadata does in the areas of I/O Resource Management, offloaded online backup and offloaded join filters, but I will.

Oracle OpenWorld Bound

Published September 18, 2008 oracle 6 Comments

Where’s Waldo?

As infrequently as I’ve posted over the last few months I’m sort of surprised I even have any readers remaining!

I will be at OpenWorld and I’d love to meet up with as many of you as I can. I’ll be working the Oracle Demo Ground in Moscone North on late Wednesday afternoon and Thursday morning. Until that point I’ll be attending sessions and catching up with a lot of folks I only get to see at the show these days. Feel free to send me an email at the address listed in my contact section of the blog.

If you are one of the people who like, or dislike, my positions on Fibre Channel SANS (i.e., the “Manly Man Series”), or want to talk more about why most Oracle shops aren’t realizing hard drive bandwidth, then send me a note and we’ll see if we can chat.

I’m really looking forward to the show this year. There seems to be significant buzz about the show, as this ComputerWorld.com article will attest.

Don’t forget to stop by the official OpenWorld Blog.

I Know Nothing About Data Warehouse Appliances and Now, So Won’t You – Part IV. Microsoft takes over DATAllegro.

Published July 24, 2008 oracle 4 Comments

It looks like my blog entries about DATAllegro (such as this piece about DATAllegro and magic 4GFC throughput) are going to start to sound a wee bit different:

Microsoft buys DATAllegro

Oracle Database 10g 10.2.0.4 Cannot Boot a Large SGA on AMD Servers Running Linux

Published July 18, 2008 oracle 9 Comments

In the comment thread of my recent blog entry entitled Of Gag-Orders, Excitement, and New Products, a fellow blogger, Jeff Hunter wrote:

I’d be happy if the major innovation was being able to run a 10.2.0.4 16G SGA on x86_64.

He offered a link to a thread on his blog where he has been chronicling his unsuccessful attempts to boot a 16GB SGA on the same iron that seemed to have no problem doing so with 10.2.0.3.

What’s New?

Oracle Database 10g release 10.2.0.4 has additional rudimentary support for NUMA in the Linux port, true, but Jeff has tried with NUMA enabled and disabled (via boot options) none of which has fixed his problems. In his latest installment on this thread I noticed that the title of the post has renamed the thread to “The Great NUMA debate” and the post ends with Jeff reporting that he still is having trouble with his 16GB SGA, but also that he can’t boot even a 4GB SGA. Jeff wrote:

I still couldn’t start a 16GB SGA. Interestingly enough, I couldn’t start a 4G SGA either! I had to go back to booting without numa=off. The saga continues…

Unfortunately, I can’t jump in and debug what is wrong on his configuration and I don’t know what the debate is. However, I can take a moment to post evidence that Oracle Database 10g 10.2.0.4 can in fact boot a 16GB SGA-in both AMD Opteron SUMA mode and NUMA mode. No, I don’t have any large memory AMD systems around to test this myself. But I certainly use to. So, I decided to call in a favor to my old friend Mary Meredith (yes, old Sequent folks stick together) who has taken over for me in the role I vacated at HP/PolyServe when left to join Oracle. I asked Mary if she’d mind booting a 16GB SGA on one of those large memory AMD systems I use to have available to me…and she did:

$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.4.0 - Production on Mon Jul 6 09:15:35 2008
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.
Connected to an idle instance.
SQL> startup pfile=create1.ora
ORACLE instance started.
Total System Global Area 1.7700E+10 bytes
Fixed Size                  2115104 bytes
Variable Size             503319008 bytes
Database Buffers         1.7180E+10 bytes
Redo Buffers               14659584 bytes
Database mounted.
Database opened.

$ numactl --hardware
available: 1 nodes (0-0)
node 0 size: 32146 MB
node 0 free: 13821 MB
node distances:
node   0
  0:  10

So, here we see 10.2.0.4 on a SUMA-configured Proliant DL585 with a 16GB buffer pool. I asked Mary if she’d be willing to boot in NUMA mode (Linux boot option) and give it a try, and she did:

$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.4.0 - Production on Mon Jul 7 10:03:35 2008
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.
Connected to an idle instance.
SQL> startup pfile=create1.ora
ORACLE instance started.
Total System Global Area 1.7700E+10 bytes
Fixed Size                  2115104 bytes
Variable Size             503319008 bytes
Database Buffers         1.7180E+10 bytes
Redo Buffers               14659584 bytes
Database mounted.
Database opened.
SQL> quit

But she reported that she didn’t get any hugepages:

$ cat /proc/meminfo|grep Huge
HugePages_Total:  8182
HugePages_Free:   8182
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

I pointed out that 8192 2MB hugepages is not big enough. I recommended she up that to 8500 and then start the database up under strace so we could capture the shmget() call to ensure it was flagging in SHM_HUGETLB, and it was:

$ cat /proc/meminfo|grep Huge
HugePages_Total:  8500
HugePages_Free:   7132
HugePages_Rsvd:   7073
Hugepagesize:     2048 kB

And from the strace:

6510  shmget(0x1420f290, 17702060032, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0600) = 393219

And…

$ ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0x1420f290 393219     oracle    600        17702060032 12

Also, in the NUMA configuration we see a good, even distribution of pages allocated from each of the “nodes”, with the exception of node zero which until Linux gets fully NUMA-aware will always be over-consumed:

$ numactl --hardware
available: 4 nodes (0-3)
node 0 size: 7906 MB
node 0 free: 2025 MB
node 1 size: 8080 MB
node 1 free: 3920 MB
node 2 size: 8080 MB
node 2 free: 3969 MB
node 3 size: 8080 MB
node 3 free: 3926 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

We also see that the shmget() call did flag in SHM_HUGETLB and correspondingly we see the shmkey in the ipcs output. We also see hugepages being used, although mostly just reserved.

So, I haven’t been able to see Jeff’s strace output or other such diagnostic information so I can’t help there. However, this blog post is meant to be a confidence booster to any wayward googler who might happen to be having difficulty booting a VLM SGA on AMD Opteron running Linux with Oracle Database 10g release 10.2.0.4.

Extra Credit

So, if Mary had booted in NUMA mode without hugepages, does anyone think it would have resulted in such a nice even consumption of pages from the nodes, or would it have looked like Cyclops? We all recall Cyclops, don’t we? In case you don’t here is a link:
Oracle on Opteron with Linux–The NUMA Angle Part VI. Introducing Cyclops.

« Previous Page — Next Page »

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive Page 21

Oracle Exadata Storage Server. No Magic in an Imperfect World. Excellent Tools and Really Fast I/O Though.

Oracle Exadata Storage Server: A Black Box with No Statistics.

Podcast: Pythian Group Oracle Exadata Storage Server Q&A with Kevin Closson.

Oracle Exadata Storage Server FAQ – Part V. Sweet and Sour Disk.

Oracle Exadata Storage Server FAQ – Part IV. A Peek at Storage Filtration and Column Projection.

Oracle Exadata Storage Server. Frequently Asked Questions. Part III.

Oracle Exadata Storage Server Related Web News Media and Blog Errata. Part I.

HP Oracle Database Machine. A Thing of Beauty Capable of “Real Throughput!”

Oracle Exadata Storage Server. Frequently Asked Questions. Part II

Oracle Exadata Storage Server Version 1. A “FAQ” is Born. Part I.

Oracle Exadata Storage Server. Part II.

Oracle Exadata Storage Server. Part I.

Oracle OpenWorld Bound

I Know Nothing About Data Warehouse Appliances and Now, So Won’t You – Part IV. Microsoft takes over DATAllegro.

Oracle Database 10g 10.2.0.4 Cannot Boot a Large SGA on AMD Servers Running Linux

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright

Archive Page 21

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright