Kevin Closson's Blog: Platforms, Databases and Storage

Oracle Exadata Storage Server. No Magic in an Imperfect World. Excellent Tools and Really Fast I/O Though.

October 9, 2008 11:15 pm

I’ve met people before who’d rather drink muriatic acid than admit a mistake. I don’t cotton to such folks none too purdy good. This is a post about perfection and imperfection.

Cost Based Optimizer

There is no such thing as a perfect cost-based optimizer (CBO) and the expectations placed on cost-based optimizers run rampant. If the Oracle Database 11g CBO was perfect there would be no need for an excellent performance tuning tool like SQL Tuning Advisor and Reactive Tuning. In those cases where CBO does not generate a perfect plan, Oracle Database equips administrators with a tuning toolset that generally gets it right.

What’s This Have to do with Oracle Exadata Storage Server?

In my blog entry entitled Oracle Exadata Storage Server: A Black Box with No Statistics, I discussed the Affinity Card Program Test Database (ACPTD). One of the reasons I use this schema and query set is that, unlike a benchmark kit, it is not perfect. I expect Exadata to function as advertised in an imperfect world.

Imagine an organization that inherited some imperfect data with an imperfect schema. Nah, that would never happen. What about queries that aren’t perfect? Ever seen any of those? What about imperfect query plans?

One of the business questions, in English, that I test with the Affinity Card Program Test Database reads as follows:

List our customers with Club Cards who purchased more than $1,000.00 worth of goods/services in the west retail region. Omit non-affinity card purchases and all restaurant, travel and gasoline purchases.

The following text box has the SQL for this business question. Now, before you get all excited about the usage of the mcc.misc column (line 5 in the text box), please remember what I just explained in the previous paragraph. Sometimes I like imperfection when I’m analyzing performance. But what does this have to do with Exadata Storage Server? Well, I could certainly use a codes table and anti-join the purchases that are not restaurant, travel or gasoline. No question there. However, for this particular query I wanted imperfection, not a lesson in schema design. So, the purpose behind the not like ‘restaurant%travel%gasoline%’ predicate is to test a more heavily weighted filtration effort in the Exadata Storage Server. It’s synthetic, I know, but at least it is imperfect-which I like.

with act as
(
select   act.card_no, act.purchase_amt from zipcodes z ,all_card_trans act ,mcc m
where    (act.card_no like '4777%' or act.card_no like '3333%') and
act.mcc = m.mcc and m.misc not like 'restaurant%travel%gasoline%' and
act.merchant_zip = z.zip and z.state in ('CA', 'OR', 'WA')
)
select
cf.custid,
sum (act.purchase_amt) sales from  act,cust_fact cf
where act.card_no = cf.aff_cc_num and cf.club_card_num not like '0000%'
group by cf.custid
having   sum (act.purchase_amt) > 1000;

Notice in the following text box how the plan shows Exadata Storage Server is filtering out all_card_trans, cust_fact, zipcodes and mcc rows. The filtering effort on mcc is not colossal, but certainly heftier than the filtration on zip.state, since there are synthetic data-cleanliness values in mcc.misc such as ‘restaurant%travel%gas oline’ and so forth.


Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(SUM("ACT"."PURCHASE_AMT")>1000)
7 - access("ACT"."CARD_NO"="CF"."AFF_CC_NUM")
10 - access("ACT"."MERCHANT_ZIP"="Z"."ZIP")
13 - access("ACT"."MCC"="M"."MCC")
18 - storage("ACT"."CARD_NO" LIKE '4777%' OR "ACT"."CARD_NO" LIKE '3333%')
filter("ACT"."CARD_NO" LIKE '4777%' OR "ACT"."CARD_NO" LIKE '3333%')
23 - storage("M"."MISC" NOT LIKE 'restaurant%travel%gasoline%' AND SYS_OP_BLOOM_FILTER(:BF0000,"M"."MCC"))
filter("M"."MISC" NOT LIKE 'restaurant%travel%gasoline%' AND SYS_OP_BLOOM_FILTER(:BF0000,"M"."MCC"))
25 - storage("Z"."STATE"='CA' OR "Z"."STATE"='OR' OR "Z"."STATE"='WA')
filter("Z"."STATE"='CA' OR "Z"."STATE"='OR' OR "Z"."STATE"='WA')
27 - storage("CF"."CLUB_CARD_NUM" NOT LIKE '0000%')
filter("CF"."CLUB_CARD_NUM" NOT LIKE '0000%')

What Does This Have to do with Cost Based Optimizer and SQL Tuning Advisor?

The fact that Exadata does not leave any of the core Oracle Database 11g value propositions off the table is a point that has been made in the press and blogging community over and over again. I’m sure I’ll get spanked for saying this, but occasional bad plans are a fact of life with cost-based optimizers-and that includes Oracle. Those of us willing to accept and live with this reality are very appreciative of the goodness offered by the SQL Tuning Advisor. There are several canned queries in the ACPDT. The other day I loaded the scale that leaves me with an ALL_CARD_TRANS table of roughly 3TB. Of the several queries, I was disappointed with the performance of only one-the query listed above with the imperfect method for weeding out merchant codes. Since this is Exadata you can rest assured that the I/O was not a problem, but the query took 721 seconds to complete! I knew that was in error because there is partition elimination on this query along all_card_trans.card_no and 721 seconds of Oracle Exadata Storage Server Smart Scan throughput on this 6-cell configuration would rack up over 4TB of total I/O (6GB/s*721). Oracle Database does not mistakingly read an entire table when partition pruning is possible, so what was the issue?

The following text box shows the plan. Notice the estimate in step 18. The plan, it seems, didn’t have the best information about the relationship between ALL_CARD_TRANS and CUST_FACT.


------------------------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                                  | Name           | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

------------------------------------------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                           |                |    15 |  1395 |   207K  (2)| 00:48:32 |        |      |            |

|   1 |  PX COORDINATOR                            |                |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)                      | :TQ10005       |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,05 | P->S | QC (RAND)  |

|*  3 |    FILTER                                  |                |       |       |            |          |  Q1,05 | PCWC |            |

|   4 |     HASH GROUP BY                          |                |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,05 | PCWP |            |

|   5 |      PX RECEIVE                            |                |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,05 | PCWP |            |

|   6 |       PX SEND HASH                         | :TQ10004       |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,04 | P->P | HASH       |

|*  7 |        HASH JOIN                           |                |    15 |  1395 |   207K  (2)| 00:48:32 |  Q1,04 | PCWP |            |

|   8 |         PX RECEIVE                         |                |     2 |   104 |   206K  (2)| 00:48:11 |  Q1,04 | PCWP |            |

|   9 |          PX SEND BROADCAST                 | :TQ10003       |     2 |   104 |   206K  (2)| 00:48:11 |  Q1,03 | P->P | BROADCAST  |

|* 10 |           HASH JOIN                        |                |     2 |   104 |   206K  (2)| 00:48:11 |  Q1,03 | PCWP |            |

|  11 |            PX RECEIVE                      |                |     2 |    88 |   206K  (2)| 00:48:11 |  Q1,03 | PCWP |            |

|  12 |             PX SEND BROADCAST              | :TQ10002       |     2 |    88 |   206K  (2)| 00:48:11 |  Q1,02 | P->P | BROADCAST  |

|* 13 |              HASH JOIN BUFFERED            |                |     2 |    88 |   206K  (2)| 00:48:11 |  Q1,02 | PCWP |            |

|  14 |               JOIN FILTER CREATE           | :BF0000        |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,02 | PCWP |            |

|  15 |                PX RECEIVE                  |                |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,02 | PCWP |            |

|  16 |                 PX SEND HASH               | :TQ10000       |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,00 | P->P | HASH       |

|  17 |                  PX BLOCK ITERATOR         |                |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,00 | PCWC |            |

|* 18 |                   TABLE ACCESS STORAGE FULL| ALL_CARD_TRANS |     2 |    58 |   206K  (2)| 00:48:09 |  Q1,00 | PCWP |            |

|  19 |               PX RECEIVE                   |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,02 | PCWP |            |

|  20 |                PX SEND HASH                | :TQ10001       |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | P->P | HASH       |

|  21 |                 JOIN FILTER USE            | :BF0000        |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | PCWP |            |

|  22 |                  PX BLOCK ITERATOR         |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | PCWC |            |

|* 23 |                   TABLE ACCESS STORAGE FULL| MCC            |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,01 | PCWP |            |

|  24 |            PX BLOCK ITERATOR               |                |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,03 | PCWC |            |

|* 25 |             TABLE ACCESS STORAGE FULL      | ZIPCODES       |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,03 | PCWP |            |

|  26 |         PX BLOCK ITERATOR                  |                |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,04 | PCWC |            |

|* 27 |          TABLE ACCESS STORAGE FULL         | CUST_FACT      |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,04 | PCWP |            |

------------------------------------------------------------------------------------------------------------------------------------------

So, I executed the query under the SQL Tuning Advisor API as follows.


var tname varchar2(500);

exec :tname := dbms_sqltune.create_tuning_task(sql_text=>' with act as ( select act.card_no, act.purchase_amt from  zipcodes z ,all_card_trans act ,mcc m where

(act.card_no like ''4777%'' or act.card_no like ''3333%'') and act.mcc = m.mcc and m.misc not like ''restaurant%travel%gasoline%'' and act.merchant_zip = z.zip a

nd z.state in (''CA'', ''OR'', ''WA'')) select cf.custid, sum (act.purchase_amt) sales from act,cust_fact cf where act.card_no = cf.aff_cc_num and cf.club_card_n

um not like ''0000%'' group by cf.custid having   sum (act.purchase_amt) > 1000');

exec dbms_sqltune.execute_tuning_task(:tname);

select dbms_sqltune.report_tuning_task(:tname) AS recommendations from dual;

SQL Tuning Advisor did recommend a different plan and, as you can see in the following text box, the primary difference is in the cardinality estimate on ALL_CARD_TRANS. With this learned information the join order changed and the result was a 6.5X speedup as the query completed in 110 seconds.


---------------------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                               | Name           | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

---------------------------------------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                        |                |     1 |    93 |   208K  (2)| 00:48:33 |        |      |            |

|   1 |  PX COORDINATOR                         |                |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)                   | :TQ10005       |     1 |    93 |   208K  (2)| 00:48:33 |  Q1,05 | P->S | QC (RAND)  |

|*  3 |    FILTER                               |                |       |       |            |          |  Q1,05 | PCWC |            |

|   4 |     HASH GROUP BY                       |                |     1 |    93 |   208K  (2)| 00:48:33 |  Q1,05 | PCWP |            |

|   5 |      PX RECEIVE                         |                |    86M|  7640M|   207K  (2)| 00:48:32 |  Q1,05 | PCWP |            |

|   6 |       PX SEND HASH                      | :TQ10004       |    86M|  7640M|   207K  (2)| 00:48:32 |  Q1,04 | P->P | HASH       |

|*  7 |        HASH JOIN BUFFERED               |                |    86M|  7640M|   207K  (2)| 00:48:32 |  Q1,04 | PCWP |            |

|   8 |         PX RECEIVE                      |                |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,04 | PCWP |            |

|   9 |          PX SEND HASH                   | :TQ10002       |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,02 | P->P | HASH       |

|  10 |           PX BLOCK ITERATOR             |                |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,02 | PCWC |            |

|* 11 |            TABLE ACCESS STORAGE FULL    | CUST_FACT      |  6278K|   245M|  1527   (1)| 00:00:22 |  Q1,02 | PCWP |            |

|  12 |         PX RECEIVE                      |                |  6802K|   337M|   206K  (2)| 00:48:11 |  Q1,04 | PCWP |            |

|  13 |          PX SEND HASH                   | :TQ10003       |  6802K|   337M|   206K  (2)| 00:48:11 |  Q1,03 | P->P | HASH       |

|* 14 |           HASH JOIN                     |                |  6802K|   337M|   206K  (2)| 00:48:11 |  Q1,03 | PCWP |            |

|  15 |            PX RECEIVE                   |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,03 | PCWP |            |

|  16 |             PX SEND BROADCAST           | :TQ10000       |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,00 | P->P | BROADCAST  |

|  17 |              PX BLOCK ITERATOR          |                |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,00 | PCWC |            |

|* 18 |               TABLE ACCESS STORAGE FULL | MCC            |   276 |  4140 |    83   (0)| 00:00:02 |  Q1,00 | PCWP |            |

|* 19 |            HASH JOIN                    |                |  7220K|   254M|   206K  (2)| 00:48:10 |  Q1,03 | PCWP |            |

|  20 |             PX RECEIVE                  |                |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,03 | PCWP |            |

|  21 |              PX SEND BROADCAST          | :TQ10001       |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,01 | P->P | BROADCAST  |

|  22 |               PX BLOCK ITERATOR         |                |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,01 | PCWC |            |

|* 23 |                TABLE ACCESS STORAGE FULL| ZIPCODES       |  3953 | 31624 |     3   (0)| 00:00:01 |  Q1,01 | PCWP |            |

|  24 |             PX BLOCK ITERATOR           |                |    78M|  2179M|   206K  (2)| 00:48:09 |  Q1,03 | PCWC |            |

|* 25 |              TABLE ACCESS STORAGE FULL  | ALL_CARD_TRANS |    78M|  2179M|   206K  (2)| 00:48:09 |  Q1,03 | PCWP |            |

---------------------------------------------------------------------------------------------------------------------------------------

Summary

After a brief bout with an imperfect query plan, the Oracle Database 11g SQL Tuning Advisor fixed what ailed me. With the plan offered me by SQL Tuning Advisor I was able to push my imperfect query through my imperfect schema and a perfect I/O rate of 6 GB/s for my small 6-cell Oracle Exadata Storage Server test configuration.

In spite of how many times EMC’s Chuck Hollis may claim that there is “nothing new” or “no magic” when referring to Oracle Exadata Storage Server, I think it is painfully obvious that there is indeed “something new” here. Is it magic? No, and we don’t claim that it’s magic. What we do claim is that we have moved filtration and projection down to storage and, more importantly, we don’t bottleneck disks (see Hard Drives Are Arcane Technology. So Why Can’t I Realize Their Full Bandwidth Potential?). So while we don’t claim Exadata is magic, there is one thing this blog post shows-all the pre-existing magic, like SQL Tuning Advisor, is still in the arsenal. When you deploy Exadata you leave nothing at the door.

Finally, a no-magic approach to something really new.

Posted by kevinclosson

Categories: oracle

Tags:

« Older Newer »

8 Responses to “Oracle Exadata Storage Server. No Magic in an Imperfect World. Excellent Tools and Really Fast I/O Though.”

I strongly agree with your views on SANs and how the current technology has connectivity & internal processing bottlenecks that cannot deliver the extreme disk I/O throughput required for large Data Warehouses. The SAN caches are not large enough to be of much help and incorrect configuration often leads to the I/O looking like random small reads after various levels of striping, rather than efficient large block sequential reads. Although SANs are good for synchronising data to a Disaster Recovery site – I’ll credit them for that.

Spreading data across multiple cheap disk arrays provides scalable I/O throughput, but if you can filter the data before pulling it into the database grid, then even better. Using RDMA with Infiniband also sounds like a very efficient way of moving the data across the grid.

So, I understand that column & row filtering are performed by the Exadata storage server. My question is, how sophisticated is the current version? For example, would these cases be optimised for processing by the storage grid?
1. SQL Having clause
2. String/Date functions in the Where clause
3. Case Statements in the Where clause

Also, are there some types of SQL statement that ideally you would like to perform data reduction for in the Exadata storage server, which are perhaps planned for a future version? e.g. Group By.

By David Sansom on October 10, 2008 at 1:49 pm
I don’t want to deflect from the point you’re making, Kevin, which I think is a very valid one, but the bad cardinality estimate on the card number predicate looks just like the sort of issue that can be effectively tackled by using dynamic sampling (one of my current obsessions). The feeling has been growing in me for some time that statistics on Oracle data warehouse tables are not nearly as valuable as on OLTP tabes, because DW queries commonly access large tables using predicates on multiple columns (yep, 11g does multicolumn stats I believe) or hit situations where table/partition statistics are unhelpful (eg. pruning on multiple partitions).

Did the explain plan report that dynamic sampling was used?

By David Aldridge on October 10, 2008 at 4:36 pm
Hi Kevin,

>> There is no such thing as a perfect cost-based optimizer (CBO) and the expectations placed on cost-based optimizers run rampant.

Yes!

I wish that Oracle would undertake to explain why it’s impossible for any optimizer to always choose the “best” execution plan . . .

By Donald K. Burleson on October 11, 2008 at 12:18 pm
You tell us, Don. We’re all ears.

By kevinclosson on October 12, 2008 at 12:39 am
Sorry, Kevin, didn’t mean to come across as too pessimistic in my blog.

Leaving hardware issues aside, how much of the software functionality shown here is available on generic servers, operating systems and storage that Oracle supports today? I was under the impression that most of this great stuff was native to Oracle products, and not a function of specific tin …

If the Exadata product has unique and/or specialized Oracle logic, well, that’s a different case.

Speaking strictly as a storage guy, here’s what I know.

— using commodity servers and storage arrays, we can usually feed in more data than a server can process, specifically true in an Oracle DW environment. I’m having a hard time seeing the advantages of pairing a commodity Xeon-based server with JBOD and claiming a performance advantage for this part of the equation. The fixed ratio of 12 disks (6 usable) per server element strikes us as a bit wasteful.

— you may be more knowledgeable than I, but we are under the impression that the IB compute node connection doesn’t bring much to the party. When we looked at many clustered Oracle DW implementations, there was plenty of bandwidth available between the compute nodes, using multiple 1Gb/sec links.

And, I know this only matters to storage people, but there’s the minor matter of having two copies of everything, rather than the more efficient parity RAID approaches. Gets your attention when you’re talking 10-40TB usable, it does.

Bottom line — what does the hardware bring to the party, rather than software? And if you can get the same benefits without dictating that customers buy a specific piece of tin, isn’t that a win for everyone?

Finally, I’d be interested in your thoughts on how enterprise flash drives fit into all of this. Yes, they’re rather expensive now, but this won’t be the case before too long.

Thanks!

By Chuck Hollis on October 15, 2008 at 8:39 pm
Interesting question from Don Burleson… I wish that Oracle would undertake to explain why it’s impossible for any optimizer to always choose the “best” execution plan…

It’s important to remember that any SQL optimizer needs to perform it’s work in a very short period of time. It would be possible for the optimizer to choose the best execution plan 100% of the time if we didn’t care how long it takes to determine the execution plan. This is the fundamental trade-off that the optimizer makes. The optimizer chooses the best plan it can quickly find.

Fortunately, Oracle has now introduced the SQL Plan Analyzer feature as part of the Real Application Testing product. SQL Plan Analyzer evaluates MANY more execution plans for the SQL because it doesn’t have to worry about how long it takes to find a better plan. SQL Plan Analyzer executes in the background rather than executing with the context of a session (while the user is waiting!). Once SQL Plan Analyzer finds a better plan, it gives you the option to implement that execution plan as a stored outline.

Hope this helps…

Chris

By Chris Craft on December 30, 2008 at 4:27 pm
@Chris

A Google search for “SQL Plan Analyzer” on oracle.com did not match any documents. Could you point me to the documentation for this feature?

Perhaps you mean the SQL Performance Analyzer (SPA) which use SQL Tuning Sets (STS) which, if a better plan is found, use SQL Profiles, not stored outlines.

By Greg Rahn on January 2, 2009 at 7:09 am
[…] Oracle Exadata Storage Server. No Magic in an Imperfect World. Excellent Tools and Really Fast I/O … […]

By Oracle Exadata Storage Server and Oracle Database Machine Related Posts « Everything Exadata on February 23, 2009 at 9:02 pm

You must be logged in to post a comment.

Mobile Site | Full Site

Get a free blog at WordPress.com Theme: WordPress Mobile Edition by Alex King.

Kevin Closson's Blog: Platforms, Databases and Storage

Oracle Exadata Storage Server. No Magic in an Imperfect World. Excellent Tools and Really Fast I/O Though.

Share this:

Related

8 Responses to “Oracle Exadata Storage Server. No Magic in an Imperfect World. Excellent Tools and Really Fast I/O Though.”

Leave a Reply